Steps for making an Excel spreadsheet available for OLAP analysis with Pentaho's Mondrian

You don't always want to spend too much money or time with a more powerful or flexible ETL tool or BI Server solution.

So imagine the situation when you have a great data exploration tool connected to your blazingly-fast data warehouse. However, you find yourself with a large Excel spreadsheet (or large csv file) that you or your colleagues would like to use with the workspaces and dashboards of your BI Server. Maybe it is a data set that has been lovingly data scrubbed offline, or it is some third-party dataset from a more primitive third party partner. Below you will see the following steps outlined for providing this spreadsheet for analysis within your BI Server environment:

  • Extract: Open your spreadsheet
  • Transform: Spreadsheet computations/formatting
  • Load: Get this data into your data warehouse
  • Publish & Explore: Prepare the Mondrian Schema so that this data mart is available for analysis with your OLAP tool

Step 1: Open your spreadsheet

Here is an example of some data curated each day by one of our clients:

Original Spreadsheet

This spreadsheet was manually created by an intern from roughly structured data reports. We’ll post soon to this blog a more “advanced” coding example of performing a more robust ETL process.

Step 2: Do some preliminary cleanup

  • We removed rows with subtotals that were manually put in there by intern (sorted the data and then deleted these rows that began with “Total” in first column)
  • We added some computed columns like formated date, prices, lbs etc.

Spreadsheet cleaned up

Step 3. Load the data into your database

3a. Export worksheet to csv file

3b. Create database and table

You need to create the datamart in your data warehouse. I tried to generate table using an csv import tool, but my tool had problems creating decimal columns:

Spreadsheet in Database

There are many tools for this of course and can create a SQL DDL file by hand. We often just quickly fire up Sequel Pro for this, but also like to use DbWrench.

3c. Import table from csv

Select the file…

Screenshot of file selection in SQL Pro

Map the columns… Screenshot of column mapping in SQL Pro

We have our data in the table (as a single fact table): Screenshot of created SQL table (note: At this point I should have renamed the ‘date’ column as it is reserved sql keyword)

Step 4. Create a Mondrian Schema for publishing to your BI Server

You can create your Mondrian schema file by hand in XML. There have been many tools for this step over the year (that have been quite buggy). Anyway, the Modeler tool available now in Spoon is good enough for this task.

4a. Download, install, and run latest Pentaho Data Integration tools (i.e., Kettle, Spoon, etc.)

(You may first need to add your database driver to your installation, in my case to this location: ~/Tools/data-integration/libext/)

4b. Create new model

Screenshot of creating a model in Spoon

4c. Define new data source

Screenshot of defining a data source

4d. Choose table

Go to Analysis tab and hit ‘refresh’ and ‘auto populate’. Do the same for the “reporting” tab. Screenshot of choosing a table

4e. Explore the data to see if the “automatic” module is good enough

You click on the “Visualize” Perspective within Spoon, or you can test the model in your other OLAP tools. We tested in our Rubbi server with our default query builder:

Go to Analysis tab and hit ‘refresh’ and ‘auto populate’. Do the same for the “reporting” tab. Visualized data in Spoon

4f. Manually Edit Your Mondrian schema

Ok, you’ll notice the “automatic” option is not quite good enough. Let’s make some tweaks… Export the Model from Spoon as a Mondrian schema XML File (‘File’ > ‘Export’ > ‘Mondrian Schema’) Remove columns that we want to be measures from dimensions (Pentaho auto-mapping added these columns as measures AND deimensions): We remove these lines:

<DimensionUsage name="Lbs" source="Lbs"/>
<DimensionUsage name="Price high" source="Price high"/>
<DimensionUsage name="Price low" source="Price low"/>

To end up with this simple cube:

<Cube name="transactions">
  <Table name="transactions"/>
  <DimensionUsage name="Date" source="Date"/>
  <DimensionUsage name="Location" source="Location"/>
  <DimensionUsage name="Size" source="Size"/>
  <DimensionUsage name="Species" source="Species"/>
  <Measure name="Lbs" column="lbs" aggregator="sum" formatString="#"/>
  <Measure name="Price high" column="price_high" aggregator="sum" formatString="#"/>
  <Measure name="Price low" column="price_low" aggregator="sum" formatString="#"/>
</Cube>

And let’s add more meaningful default aggregators for max and min price dimensions:

<Measure name="Price high" column="price_high" aggregator="max" formatString="#"/>
<Measure name="Price low" column="price_low" aggregator="min" formatString="#"/>

Finally, we’ll join ‘species’ and ‘size’ dimensions into single hierarchy (remove separate species dimension).

<Dimension name="Species">
  <Hierarchy hasAll="true">
    <Table name="transactions"/>
    <Level name="Species" column="species" type="String" uniqueMembers="false"/>
    <Level name="Size" column="size" type="String" uniqueMembers="false">
    </Level>
  </Hierarchy>
</Dimension>

Re-import the schema into your OLAP tool or publish to your BI Server (Note that Spoon doesn’t support importing existing Mondrian Schema files and “exploring” very well). 4g. Last step – adding date dimension Check-out our creating a date dimension post for adding a “dim_date” dimension table to our cube and linking in appropriately to our fact table. Let’s assume we have already created a date dimension called “dim_date” and have added a fk column to our fact table (transaction_date int). You will then remove the autogenerated date dimension and add new dimension to schema:

<Dimension name="Time" type="TimeDimension">
  <Hierarchy hasAll="true" primaryKey="date_key">
    <Table name="dim_date"/>
    <Level name="Year" column="year" uniqueMembers="true" levelType="TimeYears" type="Numeric"/>
    <Level name="Month" column="month" uniqueMembers="false"  ordinalColumn="month" nameColumn="month_name"
            levelType="TimeMonths" type="Numeric"/>
    <Level name="Day" column="day_of_month" uniqueMembers="false" ordinalColumn="day_of_month"
            levelType="TimeDays" type="Numeric"/>
  </Hierarchy>
</Dimension>

Now connect this dimension to cube:

<Cube name="transactions">
  ...
  <DimensionUsage name="Time" source="Time" foreignKey="transaction_date"/>
</Cube>

Conclusion

That’s it! Test your schema in your BI Server or OLAP tool again. However, we hope these steps are useful in figuring out how to get a “spreadsheet” of data into your BI Server as quickly as possible. If this isn’t a one-off task, you probably want to take some time to setup and/or learn to use a more “powerful” and repeatable solution like:

  • Coding some scripts, using some parsing library,
  • Using an ETL tool from Pentaho Data Integration or Talend,
  • Consider a more full-featured hosted BI Server solution which can suck in your Excel spreadsheet for you like GoodData or EazyBI.

These steps also highlight why you might want to purchase a BI Server with built-in capabilities to load a spreadsheet into your data mart for analysis.

This need happens all the time and is why the hosted BI server vendors provide tools for this step… and we will soon with Rubbi (our light BI server written in Ruby). We’ll also soon share some examples of more sophisticated ETL and code-level ways of automating this task.

Most Read

1 Team health & the retro
2 How to fold QA into every sprint
3 Cooking with the right ingredients - what's your "Definition of Ready"?
4 Android build optimization
5 Why CSS Grid will drive the web forward

Working with startups from concept to launch

We understand that creating a product is a challenging and risky endeavor and believe that having a partner with experience and know-how is a critical first step.

Learn More

The Startup Journey

From idea to launch we guide you through the startup experience

Learn More