Store everything: the lie behind data lakes

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation)  may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

Two roads diverged in a wood, and Iā€”
I took the one less traveled by,
And that has made all the difference.
— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.

It matters what you measure

We see this a lot in the data analysis space, but it's worth remembering that it really matters what you measure.

Rarely have I seen such a clear example of that as the claim today from the app developer Snappli that the percentage of people using Apple Maps on iOS6 has dropped from 35% to 4%. The Guardian nails the story, noting that the claim may well be without merit.

What happened?

Well, it turns out that Snappli was measuring not the percentage of people using the maps application each day, but the percentage of people using maps data every day. In iOS5, these percentages were effectively the same because the Maps application almost always had to download new map tiles every time it was opened.

So Snappli happily thought they had a good proxy for maps usage and over time they may have forgotten that their measurement was a proxy. They may have begun to assume that they were measuring actual maps usage. This happens all the time and it's really easy to do. But it's pretty embarrassing when something changes, causing our assumption to become invalid, and we don't realize it. That's when we start making false claims and looking pretty foolish.

That is what happened to Snappli. Because (surprise!) the Maps app in iOS6 doesn't download data very often. I just now threw my phone into airplane mode and zoomed into the last three cities I'd looked at over the last week and a half: Madison, WI; Minneapolis, MN; and Chicago, IL. In each of these cities I could see streets down to the lowest level of detail, including shops and points of interest. No data use.

So it seems that what Snappli scored as a strike against the new Maps app should actually be counted as a point in it's favor. And this whole saga can serve as a good reminder to us in the data business that we need to keep the assumptions behind our metrics and measurements in plain view as much as possible.

P.S. I realize there are some major problems with the data behind the maps app on iOS6. However, I also think it's important to focus on actual problems and not made up problems.