Tuesday, May 27, 2008

Agile Data Warehouse: The product backlog

The largest hurdle for us to be able start using Scrum in our DW project was to create a reasonable product backlog. Our product owner created an excellent list of use cases together with some of the expert users and gave some rough order of priority.

Unfortunately there was a major problem (from product backlog point of view) with the use case list. Most of the use cases required that we already had loaded all of the dimension models with data. And the use cases with highest priority contained some tricky calculated measures that would furthermore require use to load and finish some complicated data marts. To make things even worse, the expert users made it clear that they needed full scale loading of real data to the DW to be able to give feedback.

To be able to deliver the most important use case we estimated that we would need to work a couple of months and build most of the DW and after that the project would be close to finished. User stories in a backlog should preferably have the size of a couple of days rather than a couple of months. So we where basically back into a waterfall project. Why was that?

The main work in many DW projects is to implement the ETL process. And this in turns means data flow programming, the data from the sources is refined, merged and transformed in a number of steps to the basic dimensional models. Then more refined measures and marts can in turn be calculated through a new set of transforms to more refined dimension models and marts. We can see that the very nature of DWs introduce dependencies between the different parts. No wonder that most DW project use waterfall or prototype + waterfall as methodology.

After our naive attempt to follow the use cases directly failed we focused on breaking down the (too) large user stories into small enough user stories. Our requirements on the user stories is that they still shall deliver some (small) user value and that they should be possible to demonstrate. There will be dependencies between some of the small user stories due to the nature of DWs. But for many of them the order and scope is possible to steer. So with the new list of small user stories the product owner have the possibility to decide about priorities again (with some constraints).

So how can one break down the large use cases to small enough user stories? First a DW consists typically of several star schema, some of which are constructed directly from basic source data and other are transformations and aggregations from other star schema. Here one can start by implementing the most important of the basic star schema. In our case that was the receipt line star.

A star might still be to much work, then one can chose the most important dimension (in our case the customer dimension) and start implementing that. While a single dimension might not be that exciting it is still possible to demonstrate and get customer feedback on. If even a single dimension is to much work to be accomplished on a couple of days, one can chose the most important attributes of a dimension and load them. That is still possible to demo.

If the fact table of the star is considered more important than the dimensions we have the problem that there are no dimensions to look up as they are not created yet. A prototype fact table can still be loaded in a staging table with the natural keys of the dimensions rather than the surrogate keys. On such a prototype fact table it is possible to demonstrate that sums of groupings are right.

Loading a single dimension can still be complicated if one use daily incremental loading with some history management. In such cases one can start with doing a initial load and add incremental loads and history management in later user stories. If the amount of data is large one can start with loading a small subset. If it is complicated to get data from the real source it might be possible to get at least part of the needed data in some simpler way (some existing reports, ad hoc SQL queries, synthetic data etc.)

In our team most members initially doubted that one could find fine grained user stories in DW projects, but it become soon quite natural.

Now given a product backlog with small enough user stories one can start to build the DW breadth first or depth first. With breadth first one choses one or few important stars and use every possible simplification (like the one discussed above) to fill it with some data. An advantage with this strategy is that we can get early user feedback on large parts of the design and also that we can start early to develop dependent systems like other dependent stars, data marts, OLAP cubes, reports and use the data as input to analysis tools or other applications. A drawback is that we don't deliver anything that is actually shippable.

With a depth first strategy one might start with a single dimension and refine it and its loading scheme through a sequence of user stories all the way to something that can be put into production. While putting a single dimension into production might seem rather pointless one might actually have some use for it. If we start with the customer dimension and construct a fact dimension from it in the OLAP cube, we now have the possibility to easily get answers on how many customer there are with different properties using a cube browser. And we also start to test our infrastructure, operations and logistics at an early stage, which means that we will get feedback on if that works early as well.

In our project we started breadth first with a "prototype sprint" and continued with depth first for some parts and breadth first for other parts. This depended on priority and what was actually possible considering various dependencies and constraints.

While it took some mental effort to leave the waterfall thinking with big deliverables, we quite soon learned to see all kinds of possibilities to cut the big deliveries down to small user stories that could be implemented in a few days. And as a result the project have been much more transparent both for the customer and the developer team. We have been able to involve stakeholders earlier. The project have been more steerable. And we are quite certain on that we have delivered much faster.

5 comments:

A. ELOMARI said...

interesting ! really a good job. Thank you for sharing your knowledge with us.

I was looking for theses details for a long time, i was pretty sure that applying scrum or any agile approach to BI/DW project is a challenging labour... your article gave me a lot off idea to complete the "reflexion" about how to apply agililty to the dimensional lifecycle.

Hope to read your next articles about budgeting and staffing a BI/DW team using the agile approach.

thanks again.

Fábio de Salles said...

Very good! I have read the Agile DW book by Ralph Hughes and he gives quite a very similar advice, but he fails to go as far as you have, reporting on the process of learning how to split epic stories down to user stories and categorize them.

The depth/breadth question is very usefull and a genius thing - congratulations!

I have came to some of your thinking, but I was wondering of how to conduct the demos. In my first attempting to agile develop a DW I have tought of a sprint giving a full star - bad. I would have had either poor quality dimensions and fact or limited ones. You've shown me how to cut the sprint shorter so I can deliver quality results albeit not complete ones before the next couple or so sprints.

Sad thing this is the last post!

Fábio de Salles said...

In this context, how would be "done" defined? Scrum requires that each sprint deliver a potentially shippable product increment. So, it should be a thoroughly tested process, individually actionable, and the operation of process is documeneted either in help files or user documentation.

Fábio de Salles said...

This former be for the DW (and its stars). There also should be something for data visualization...

Albina N muro said...

The largest hurdle for us to be able start using Scrum in our DW project was to create a reasonable product backlog. Our product owner created. survival-warehouse.com reviews