Sunday, May 25, 2008

Agile Data Warehouse: Intro

I've (successfully I think :) ) led a medium sized data warehouse project using Scrum. As I haven't found that much written about using agile methods for data warehouse projects (there are a few articles though) I'll write about our experiences in a series of postings.

So, can't you just use an agile method like Scrum in your data warehouse project? Yes, you can! But to be able to do that you first need to figure out how to create a decent product backlog for your DW project. And second, you need to use a number of agile practices, to have any chance what so ever to actually make your project agile. Third, operations tend to be complicated, if you don't take operations into account you will not get much agility in the end.

For application development most developers knows about TDD, refactoring, continuous integration etc., there are excellent examples and tutorials about how to do it and there are a lot of tool support for it. Database development in general and more specifically DW development are 5-10 years behind application development in this area. There are well thought out agile data base practices, but most DW developers don't know about them yet, and the tooling support is still quite primitive (at least Visual Studio have some support). As your customer probably pay you for delivering a DW rather than to further the state of art of agile DW practices, you have to chose agile practices that are reasonably simple to implement.

Operations for a DW can be complicated. It often integrates data from a number of different sources. The sources are often legacy systems and their reliability and data quality can vary. Timeliness for DW/BI systems is everything as the data often is used as the base for various important decisions. To succeed in making a DW project agile the operations must be involved and part of the project from the very start.

Before getting into the details about how to apply agile methods on DW projects I'll present a part of a DW that could be used as base for sales analysis for a retailer. It will be used to make the agile methods more concrete in the sequel.

When analyzing sales you want to know things like how much was sold; in total, during a particular month, of a certain product, to a specific customer segment etc. A convenient way to build a DB model that supports a wide range of such queries is to use dimensional modeling (aka a star schema).

For this example the finest granularity that we care about are receipt lines. Each receipt line contains a couple of values; amount, VAT, number of items etc. The sales transaction described by the receipt line took place at a certain date, in a particular store, a certain product was bought by a particular customer. A dimensional model for that could look like:
The ReceiptLine table in the middle is called a fact table. There will one row for each receipt line. The rows contains the various (typically numerical) values of the receipt line, e.g. amount. Values in fact tables are preferably chosen so that they or summable
(or combinable in some other manner like using average or count). The various attributes that can be used for indexing or describing the fact are put into dimension tables. In this case Product, Customer, Date and Store. Some attributes in a dimension can form a hierarchy, for the Date dimension a year consists of months that consists of dates as an example. In a real DW there will be a couple of star schema that capture different parts of the business process, there sill be more dimensions and numerous attributes in the dimensions.

The dimension model can be seen as a hypercube that contains facts that are spanned by the dimensions. The receipt line star is a four dimensional hypercube. The typical analysis queries will sum the fact values of a sub cube constrained by subsets of the dimension. There are special purpose databases called OLAP cubes that gives high performance for such "sub cube" queries by pre calculating sums for sub cubes at different aggregation levels along the hierarchies of the dimensions.

The data needed for building a DW often resides in various operational systems like customer registries, product registries, transaction DBs for the recepies. There might also be several sources with partly overlapping contents that needs to be merged. The major part of the work in a DW project is typically to get the data from the various sources to the dimension models. This is done with some ETL (Extract, Transform and Load) framework.

While loading the data from the sources to the dimensional model might seem simple, it can be complicated enough in practice. The source systems can be old, having data in obscure non documented formats, been "improved" continuously the last 15 years, have varying data quality and so on. The sources might come from different parts of the organization and there can be all kinds of politics involved. Also there might be a need to include all sorts of "manual data" contained in Excel sheets and other formats. Add to this that the amount of data can be huge (we have more than one billion rows in the largest fact table) and that updates typically are daily and that the customer depends on having correct data fast.

A typical ETL architecture can look like:
Here we can see a number of input sources that the data is extracted from to a raw DB, from that in turn data is transformed to fit the dimension model. Some middle results of the transformation might be stored in a staging DB. As dimension models storing transactions with fine granularity can be huge, often OLAP cubes and/or data marts are needed for getting good performance. Reporting, analysis and more specialized operational applications are connected to the DW (through dimensional models, OLAP cubes and data marts) to make the data available to users.

In the next posts I'll get into how to use agile methods for building a DW.


Fábio de Salles said...

I am now looking for agile methods to develop DW and I have found your post. I've enjoyed it and I am eager to read the next ones. So far, congratulations on your achievments. I'm just dumbfounded as how you've got no post until now!!

BIInterested said...

Awesome description!! Every detail is is mentioned in as simple way as possible

albina N muro said...

Agile Data Warehousing Project Management will give you a thorough introduction to the method as you would practice it in the project room to build a serious. survival warehouse location

John Barness said...

Today, many companies need virtual cloud-based data repositories. I think managing data warehouses is a task that has the highest priority for interested parties. In my experience, I have dealt with virtual data rooms many times and have to say it is a good solution.

Unknown said...

Thank y John, your link is amazing, my company now are using your service in all deals.
security online