Tuesday, June 10, 2008

Bright developer

I was on Computer Sweden's list of Sweden's brightest developers :)

Monday, June 2, 2008

Tuesday, May 27, 2008

Agile Data Warehouse: The product backlog

The largest hurdle for us to be able start using Scrum in our DW project was to create a reasonable product backlog. Our product owner created an excellent list of use cases together with some of the expert users and gave some rough order of priority.

Unfortunately there was a major problem (from product backlog point of view) with the use case list. Most of the use cases required that we already had loaded all of the dimension models with data. And the use cases with highest priority contained some tricky calculated measures that would furthermore require use to load and finish some complicated data marts. To make things even worse, the expert users made it clear that they needed full scale loading of real data to the DW to be able to give feedback.

To be able to deliver the most important use case we estimated that we would need to work a couple of months and build most of the DW and after that the project would be close to finished. User stories in a backlog should preferably have the size of a couple of days rather than a couple of months. So we where basically back into a waterfall project. Why was that?

The main work in many DW projects is to implement the ETL process. And this in turns means data flow programming, the data from the sources is refined, merged and transformed in a number of steps to the basic dimensional models. Then more refined measures and marts can in turn be calculated through a new set of transforms to more refined dimension models and marts. We can see that the very nature of DWs introduce dependencies between the different parts. No wonder that most DW project use waterfall or prototype + waterfall as methodology.

After our naive attempt to follow the use cases directly failed we focused on breaking down the (too) large user stories into small enough user stories. Our requirements on the user stories is that they still shall deliver some (small) user value and that they should be possible to demonstrate. There will be dependencies between some of the small user stories due to the nature of DWs. But for many of them the order and scope is possible to steer. So with the new list of small user stories the product owner have the possibility to decide about priorities again (with some constraints).

So how can one break down the large use cases to small enough user stories? First a DW consists typically of several star schema, some of which are constructed directly from basic source data and other are transformations and aggregations from other star schema. Here one can start by implementing the most important of the basic star schema. In our case that was the receipt line star.

A star might still be to much work, then one can chose the most important dimension (in our case the customer dimension) and start implementing that. While a single dimension might not be that exciting it is still possible to demonstrate and get customer feedback on. If even a single dimension is to much work to be accomplished on a couple of days, one can chose the most important attributes of a dimension and load them. That is still possible to demo.

If the fact table of the star is considered more important than the dimensions we have the problem that there are no dimensions to look up as they are not created yet. A prototype fact table can still be loaded in a staging table with the natural keys of the dimensions rather than the surrogate keys. On such a prototype fact table it is possible to demonstrate that sums of groupings are right.

Loading a single dimension can still be complicated if one use daily incremental loading with some history management. In such cases one can start with doing a initial load and add incremental loads and history management in later user stories. If the amount of data is large one can start with loading a small subset. If it is complicated to get data from the real source it might be possible to get at least part of the needed data in some simpler way (some existing reports, ad hoc SQL queries, synthetic data etc.)

In our team most members initially doubted that one could find fine grained user stories in DW projects, but it become soon quite natural.

Now given a product backlog with small enough user stories one can start to build the DW breadth first or depth first. With breadth first one choses one or few important stars and use every possible simplification (like the one discussed above) to fill it with some data. An advantage with this strategy is that we can get early user feedback on large parts of the design and also that we can start early to develop dependent systems like other dependent stars, data marts, OLAP cubes, reports and use the data as input to analysis tools or other applications. A drawback is that we don't deliver anything that is actually shippable.

With a depth first strategy one might start with a single dimension and refine it and its loading scheme through a sequence of user stories all the way to something that can be put into production. While putting a single dimension into production might seem rather pointless one might actually have some use for it. If we start with the customer dimension and construct a fact dimension from it in the OLAP cube, we now have the possibility to easily get answers on how many customer there are with different properties using a cube browser. And we also start to test our infrastructure, operations and logistics at an early stage, which means that we will get feedback on if that works early as well.

In our project we started breadth first with a "prototype sprint" and continued with depth first for some parts and breadth first for other parts. This depended on priority and what was actually possible considering various dependencies and constraints.

While it took some mental effort to leave the waterfall thinking with big deliverables, we quite soon learned to see all kinds of possibilities to cut the big deliveries down to small user stories that could be implemented in a few days. And as a result the project have been much more transparent both for the customer and the developer team. We have been able to involve stakeholders earlier. The project have been more steerable. And we are quite certain on that we have delivered much faster.

Sunday, May 25, 2008

Agile Data Warehouse: Intro

I've (successfully I think :) ) led a medium sized data warehouse project using Scrum. As I haven't found that much written about using agile methods for data warehouse projects (there are a few articles though) I'll write about our experiences in a series of postings.

So, can't you just use an agile method like Scrum in your data warehouse project? Yes, you can! But to be able to do that you first need to figure out how to create a decent product backlog for your DW project. And second, you need to use a number of agile practices, to have any chance what so ever to actually make your project agile. Third, operations tend to be complicated, if you don't take operations into account you will not get much agility in the end.

For application development most developers knows about TDD, refactoring, continuous integration etc., there are excellent examples and tutorials about how to do it and there are a lot of tool support for it. Database development in general and more specifically DW development are 5-10 years behind application development in this area. There are well thought out agile data base practices, but most DW developers don't know about them yet, and the tooling support is still quite primitive (at least Visual Studio have some support). As your customer probably pay you for delivering a DW rather than to further the state of art of agile DW practices, you have to chose agile practices that are reasonably simple to implement.

Operations for a DW can be complicated. It often integrates data from a number of different sources. The sources are often legacy systems and their reliability and data quality can vary. Timeliness for DW/BI systems is everything as the data often is used as the base for various important decisions. To succeed in making a DW project agile the operations must be involved and part of the project from the very start.

Before getting into the details about how to apply agile methods on DW projects I'll present a part of a DW that could be used as base for sales analysis for a retailer. It will be used to make the agile methods more concrete in the sequel.

When analyzing sales you want to know things like how much was sold; in total, during a particular month, of a certain product, to a specific customer segment etc. A convenient way to build a DB model that supports a wide range of such queries is to use dimensional modeling (aka a star schema).

For this example the finest granularity that we care about are receipt lines. Each receipt line contains a couple of values; amount, VAT, number of items etc. The sales transaction described by the receipt line took place at a certain date, in a particular store, a certain product was bought by a particular customer. A dimensional model for that could look like:
The ReceiptLine table in the middle is called a fact table. There will one row for each receipt line. The rows contains the various (typically numerical) values of the receipt line, e.g. amount. Values in fact tables are preferably chosen so that they or summable
(or combinable in some other manner like using average or count). The various attributes that can be used for indexing or describing the fact are put into dimension tables. In this case Product, Customer, Date and Store. Some attributes in a dimension can form a hierarchy, for the Date dimension a year consists of months that consists of dates as an example. In a real DW there will be a couple of star schema that capture different parts of the business process, there sill be more dimensions and numerous attributes in the dimensions.

The dimension model can be seen as a hypercube that contains facts that are spanned by the dimensions. The receipt line star is a four dimensional hypercube. The typical analysis queries will sum the fact values of a sub cube constrained by subsets of the dimension. There are special purpose databases called OLAP cubes that gives high performance for such "sub cube" queries by pre calculating sums for sub cubes at different aggregation levels along the hierarchies of the dimensions.

The data needed for building a DW often resides in various operational systems like customer registries, product registries, transaction DBs for the recepies. There might also be several sources with partly overlapping contents that needs to be merged. The major part of the work in a DW project is typically to get the data from the various sources to the dimension models. This is done with some ETL (Extract, Transform and Load) framework.

While loading the data from the sources to the dimensional model might seem simple, it can be complicated enough in practice. The source systems can be old, having data in obscure non documented formats, been "improved" continuously the last 15 years, have varying data quality and so on. The sources might come from different parts of the organization and there can be all kinds of politics involved. Also there might be a need to include all sorts of "manual data" contained in Excel sheets and other formats. Add to this that the amount of data can be huge (we have more than one billion rows in the largest fact table) and that updates typically are daily and that the customer depends on having correct data fast.

A typical ETL architecture can look like:
Here we can see a number of input sources that the data is extracted from to a raw DB, from that in turn data is transformed to fit the dimension model. Some middle results of the transformation might be stored in a staging DB. As dimension models storing transactions with fine granularity can be huge, often OLAP cubes and/or data marts are needed for getting good performance. Reporting, analysis and more specialized operational applications are connected to the DW (through dimensional models, OLAP cubes and data marts) to make the data available to users.

In the next posts I'll get into how to use agile methods for building a DW.