Toward an analysis of datawarehouse and business intelligence challenges - part 1

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

Since I work in and around datawarehousing and business intelligence, I've developed notes and thoughts over the years on the key challenges in these areas. New technologies and architectural approaches are drastically changing the landscape of the field and can help to address some of these challenges, but enterprise software vendors and customers are often not aware of new approaches or their applicability to the classic problems of the field, which continue to persist.

I'm starting to compile this list publicly, in what I hope will be a more-or-less living document, because I will start using it to evaluate the applicability of newly maturing technologies (in-memory, non-relational or NO-SQL databases, etc.) and architectures (map-reduce, streaming datawarehousing, etc.) to these old problems. This list is a survey, not an in-depth analysis of the problems. I may provide more in-depth analyses if it seems relevant, but I will more likely look for and point to references where they are available.

This is about half of my initial list and is in no particular order. I'll post the second half of the list shortly.

Data volume

This is a classic datawarehousing problem, often addressed through data modeling, aggregation, or compression. Even though it is one of the oldest problems of the field, it is by no means solved, or even optimally addressed. Enterprises continue to struggle with the cost and technical feasibility of scaling up their datawarehouses, often due to limitations of the underlying database technology.

Data quality

We may seem to be able to procure and store all of the data necessary, but that is no guarantee that the data is correct. This challenge has more to do with data being wrong than with data being misunderstood or semantically misaligned, though this is a related problem. Data quality issues can arise for many reasons including incorrect data at the point of entry, incomplete data, duplicate data, or data that becomes incorrect because of an invalid transformation.

Data consistency

Even when data is correctly stored in a datawarehouse, it may become temporarily inconsistent under certain operations. For example, when deleting or loading data, there may be a period of time when queries can access part of the data being loaded or deleted, but not all of it. This can be thought of as an inconsistent state, and while most datawarehousing tools ensure consistency in some manner, this is an area that may sometimes be traded for better handling of another challenging area. The classic tradeoff is between consistency, loading performance, and query performance.

Semantic integration

An oft-overlooked, but extremely important concept, semantic integration challenges comes in two flavors: Homonyms, meaning data that has the same name but different meanings (your "Revenue" may not be the same as my "Revenue"). Synonyms, meaning data that has the same meaning but different names.

Historical data

Dealing with historical data is a challenge that could be subsumed under other challenges. Usually the problems here are mostly issues of handling volume, master data management (changing dimensions), and semantic integration. However, historical data brings some unique angles to these challenges, including possible relaxation of requirements around performance and detail, as well as new legal, audit, and governance requirements around retaining or discarding specific data-sets.

Unstructured data

Datawarehouses have always focused on structured data, primarily because of a complete lack of tools for handling unstructured data rather than because of a philosophical view that unstructured data does not belong in a datawarehouse. This is not to say that the philosophical view does not exist, but rather that the philosophical view derives from an inability to execute rather than any underlying principle, and so should be ignored in light of new tools.

Unstructured data brings with it design constraints and requirements that do not normally appear in datawarehousing discussions. These include a lack of design-time information about dimensionality, the existence of non-numeric "key figures" (text- or image-based data, for example), document-oriented data, and the need for full-text search. Additionally, the challenge of derived dimensions and measures is strongly related to unstructured data, as these are key tools for allowing us to derive structured reporting and analysis from unstructured data-sets.