esjewett's blog

Different strokes for different folks: Optimal data storage in the datawarehouse

I've been doing a lot of thinking lately about optimal architectures for datawarehouse and analytics platforms. I'm particularly interested in the platform question because it is the chosen approach of most vendors and users in this space. Unlike Google, Facebook, and Yahoo!, most large companies are not building their own platforms, but are instead buying a platform and implementing their datawarehouse or analytics on top of it.

One thing that has become quite clear to me over the last year or so, is that the abstractions in this technology space leak like a seive. Actually, I'll amend that. All abstractions are leaky; under certain conditions they fail to abstract away details reliably and those details rear up their ugly heads and bite you unexpectedly. That's what interesting abstractions do: they incorporate the best way of doing things in a particular problem space and provide a simple interface so that people working the next level up the stack don't have to think about the issues one level down.

Leaky abstractions

In the case of datawarehousing, business intelligence, and analytics, the industry picked up abstractions that had been developed under other use cases and put them to work on something new and different. This happened, unfortunately, without too much thought about how well-suited these abstractions were to the job at hand. And lo, ugly heads were reared. People were bitten.

The most problematic abstraction is the data storage abstraction. That is, the Relational Database Management System (RDBMS), or in some more recent cases a Multi-dimensional Online Analytical Process (MOLAP) storage engine. Unfortunately there are a lot of cases where these storage abstractions leak through into the application layer. Actually, they don't leak so much as gush. Whole minor and not-so-minor technical disciplines have grown up around these abstraction breaches:

  • Dimensional Modeling - if you don't get it just right, your storage engine will eat your for lunch
  • Extraction Transformation and Loading - the contortions we go through to make reporting fast also make writing excruciating
  • Datawarehouse Reorganization - this is what happens when you get your dimensional modeling wrong (often because it turns out we can't predict the future)

There are a lot of reasons that we got this abstraction wrong. They range from laziness to frameworks (abstractions) like Zachmann that put the focus on the wrong set of conceptual abstraction layers for this sort of question.

But I'm not really here to talk about why it is wrong. There should be a way to get it pretty much right pretty easily: Use the best tool for the job.

The right tool for the job

In the datawarehousing community a lot of work has been done talking about the different types of data that are part of a datawarehouse. The goal of these conversations has generally been to assist with dimensional modeling and datawarehouse architecture. I'm beginning to think that this was completely the wrong focus. Instead, the results of these discussions should have been applied to fixing the storage layer and restoring a working abstraction, rather than trying to patch up the hopelessly broken abstraction we are currently dealing with.

How would this be done? Well, there are several basic types of data in a datawarehouse system:

  • OLAP data - fact tables and dimensions, traditionally
  • Detail data - usually a direct copy of data from transactional systems
  • Master data - reference information consisting of the set of members used in each dimension as well as properties of these members
  • Meta data - information about the datawarehousing system itself, including data model definitions, query definitions, etc.
  • Reporting Cache - data in a fast-access format optimized solely for reporting

Each type of data has characteristics that make it more amenable to a different type of storage engine. Even within the types (especially OLAP data) there are distinctions that make certain sorts of data amenable to different storage engines or differently tuned storage engines. Some rough thoughts about optimal storage engines for each type of data:

  • OLAP data - column-based, compressed, append-only
  • Detail data - column-based or row-based depending on the insert and reporting behavior desired
  • Master data - column-based, or row-based with heavy indexing on the property fields
  • Meta data - relational, row-based
  • Reporting Cache - key-value

We start to understand why Vertica is relatively very fast in the datawarehousing and analytics workload space compared to similarly architected but relational/row-based high-performance databases (3). [Yes, I did sort of tear this paper (3) apart in a previous blog, but here I am referring to it's valid comparison of two relatively mature and similarly targeted commercial distributed DBMSes.] However, a columnar store is not necessary well suited to some of the more mundane workloads of a datawarehousing system, like datawarehouse management (meta-data) and caching.

My conclusion is that datawarehouse, business intelligence, and analytics platforms would do well to facilitate an approach based on different types of storage engines for different types of workloads. There are certainly difficulties here, due to the lack of standard interfaces for column-based systems (usually SQL is not a fully supported interface and JDBC/ODBC interfaces are also few and far between), but at the least a platform should strive to bring one excellent OLAP storage engine to the table and provide pluggable interfaces for meta-data and caching storage engines.

References

  1. Ralph Kimball, Margy Ross. The DataWarehouse Lifecycle Toolkit, Wiley, 2nd ed., 2002
  2. The Zachman Framework on Wikipedia
  3. A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", SIGMOD'09, June 29-July 2, 2009, Providence, Rhode Island, USA. Copyright 2009 ACM 978-1-60558-551-2/09/06
Syndicate content