Store everything: the lie behind data lakes

February 04, 2015 by Ethan Jewett

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation) may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

@esjewett @BoobBoo @oswaldxxl what is data and what is noise is an existential question :)
— Vijay Vijayasankar (@vijayasankarv) January 31, 2015

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

“Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.”

— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.

The BW/BObj Rift - We're all just services

November 06, 2012 by Ethan Jewett

Eric Vallo has written a quite wonderful and balanced piece starting to dig into the rift between the BW and BusinessObjects BI communities over on the EVTechnologies blog. As Eric mentions, his blog was the result of quite a bit of back-and-forth between the two of us, the other principals involved in the DS Layer site, and the wider SAP and BusinessObjects communities. Eric was kind enough to kick off the conversation, and I'll continue it here. Hopefully this will evolve into a longer public conversation.

So, let's get started!

Eric did a good job of laying out the different scenarios we tend to see out in the SAP & BusinessObjects customer ecosystem right now. While each of these scenarios is clear, I think there is still a huge amount of confusion within both the BW and BusinessObjects communities about what each set of tools does and what it should be used for. In other words, when do you go for each scenario or tool? In my opinion, this confusion stems from the reality that both toolsets do a lot of things, and there is a lot of overlap. You can build a full datawarehouse and BI reporting solution using only BusinessObjects tools. And you can do the same using only BW tools. Or you can mix-and-match and pull in 3rd-party tools as well. But each toolset certainly has its strengths and weaknesses.

My take on BW's strengths and weaknesses is, perhaps, a bit out of the mainstream. But I think it is justified and, importantly, I think it is in line with SAP's vision for it's data platform. Specifically, I see BW as a set of data warehousing, data management, and systems management services. I admit readily that I've shamelessly stolen this way of describing the concept from others, but it fits so well that I have to use it. So, BW provides a set of services like the following very incomplete list:

An abstract, platform-independent data warehouse modeling environment
A service for imposing a semantic model on top of the data warehouse modeling environment
An analytic engine for executing queries on the data residing under the semantic model
Security
Data management services like near-line storage (the ability to pretty transparently persist data in a single semantic structure across two data-stores with remarkably different latency, structure, performance, and cost characteristics while still being able to query that data "live") or archiving.
Visualization and reporting services

The implementation of this service concept is far from perfect. In fact, today, most of these "services" aren't really recognizable as such because BW is such an integrated intertwined monster. With BW on HANA, we are now starting to see some of these services exposed individually, and the result is rather impressive.

One advantage of seeing BW as a collection of services for the purposes of this discussion is that it becomes much less important whether BW continues to exist. When viewed this way, the whole discussion about whether or not SAP HANA is going to "kill" BW becomes moot. The important question is whether the services continue to exist, either as part of BW, part of the HANA platform, or in another way. It is the services that provide the value, not BW per se.

The same actually holds true for BusinessObjects. Here we have services like the following (again, incomplete):

A service for imposing a semantic model on existing data models
An analytical engine for executing queries on the data residing under the semantic model
Security
Services for moving data from one place to another and transforming that data
Services for tracking and analyzing data lineage and quality across multiple platforms
Visualization and reporting services

BusinessObjects took a much more service-oriented development approach from the beginning, so many of these services are much more obvious as individual BusinessObjects products or product components. You might recognize direct descriptions of Universes, Data Services, or Information Steward, or the host of BusinessObjects reporting and visualization tools whereas in the case of the BW services there is no individual tool we can install to provide the service.

You probably also notice some overlap with the list of services in BW. Hence, the confusion.

The really interesting part of this overlap is that some of these services are provided in pretty similar ways between the two platforms (the semantic layer) while others take a very different approach (visualization and reporting services) or simply don't exist on one platform or another (abstract data warehouse modeling environment, or tracking and analyzing data lineage). It's these different approaches or unique services that differentiate the platforms.

Until SAP matures its data platform strategy and makes some decisions about which overlapping services live, die, or merge on each platform, it's going to be up to customers to do the hard thinking necessary to make the best implementation decisions for them. It is, of course, important to know what services each platform offers, what technique the platform uses to offer those services, and what kind of performance and functional characteristics similar services have on the different platforms. I hope that the podcast series Josh Fletcher and I are working on will help with that discussion across customers. But in addition to these feature considerations, there are also cost, requirements, tooling, and expertise considerations that are going to be unique to each customer.

Personally, I think BW and BusinessObjects tools both have a lot to offer and we shouldn't discount one or the other when we're talking about addressing data warehousing or BI problems. As consultants, we'll have our personal preferences, but we should try not to let that influence our customer advisory roles when addressing early design and tool selection questions.

In the data warehousing space, my strong feeling is that BW's approach (if not BW itself) is a powerful approach to making datawarehousing more accessible. It doesn't solve the basic problems of data modeling and data governance, and it can introduce its own set of problems. But BW does help abstract some really knotty technical and organizational problems that like data consistency, technical governance, platform-specific optimization, and technical change management that must be addressed when building a data warehouse.

Meanwhile, the BusinessObjects approach of a really open, modern, and platform agnostic BI and EIM toolset is a great one. Especially in the ETL and BI space, many of the services that BusinessObjects offers are years, even decades ahead of where BW is with its more integrated and older transformation and reporting engines.

I'm hoping that the two communities can learn a lot from each other by interacting more both technically and socially. In other words, like Eric, I am really looking forward to feedback and teaching through other blogs or Twitter.

And now, back to you, Eric.

Toward an analysis of datawarehouse and business intelligence challenges - part 2

July 24, 2012 by Ethan Jewett

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

This is the second half of an analysis, and with my first post on the topic constitutes my first swipe and listing the current fundamental challenges of the datawarehousing and business intelligence fields. The list is in no particular order and will surely change in the future. It is conceived as the beginning of a framework from which to evaluate new or maturing technologies and architectures from the perspective of applicability to the field.

Aggregating silo-ed data sources

Silos silos silos. Anyone trying to do data analysis has run into this problem: the data exists, but we can't get at it. The technical aspects of this challenge are many (bandwidth, interfaces, and ETL), but it's worth noting that they are usually dwarfed by the cultural and organizational obstacles (default against sharing, departmental rivalries), many of which are in place for good reason (security and permissions concerns, privacy laws).

Representing data in a meaningful way

Historically this feels like one of the least-addressed challenges, but we are finally seeing some serious attention paid to this problem. Challenges in representation of data range from visualization (and the related topic of responsible visualization - as visualization is too often untruthful), to analytical views and tools, through search and guided data exploration.

As we stand, the data in datawarehouses and business intelligence datamarts is too often opaque and misunderstood by most users. Even the most impressive and advanced visualizations and analysis tools (Gapminder, BusinessObjects Explorer, and Qlikview, for example) are still highly guided constructs that are often only applicable to predetermined datasets. We have come a long way (finally) over the last decade, but we have a long way yet to go.

Representing reporting structures

Reporting structures are now fairly well understood, but representing them efficiently in our datawarehouses or BI tools remains a challenge. Some examples of such structures: reporting hierarchies, time-dependency, calculated measures, and derived or dependent characteristics. Challenges revolve around rollup and calculation performance, reorganization due to reporting structure changes, and accessibility to potential users.

Performance

Traditionally this is the "big one" and it is still very much an unsolved problem. Bound by the CAP tradeoff, we are more or less forced to give up either consistency, availability, or partition-tolerance in order to improve performance under constant resources. Two approaches prevail: architectures that give up one or more of the three in exchange for performance, and architectures that attempt to better optimize for the problem-space in order to improve performance while maintaining all three CAP axes. Both are perfectly legitimate approaches, but it will be important to recognize which architectural approach is being pursued in any given product or technology. As a wise person once said, "there is no such thing as a free lunch".

Further complicating matters, there are multiple performance aspects of datawarehouse and business intelligence applications, and we need to be clear which ones we attempt to optimize for. These aspects include query performance (keeping in mind the random access vs. batch/bulk/scan access difference), data loading (ETL) and reorganization, and (in some systems) writeback or changing of data.

Security

Security models pose more of a management problem than a technical problem for datawarehouse and BI applications. Nonetheless, I think they're worth mentioning as a core challenge to keep in mind, just in case someone comes up with a way to make reasoning about security in analytical-processing-oriented datasets less painful.

Data loading

Last but certainly not least, data loading is a perennial headache in datawarehouse and BI systems. The three basic types of data loading (batch, real-time/streaming, and write-back/input) all to some extent conflict with each other. Add to that the complexity of managing a profusion of delta mechanisms (many of which exist for good reason, others of which exist because of careless design) and different interface formats and we've got ourselves a real party. Standardization of interfaces and design practices are the key touchstones of conquering this challenge, but as with many of these challenges, this is more of a human problem than a technical problem.

Conclusion - technical vs. design challenges

If we take one thing away from this enumeration of the challenges of the datawarehouse and business intelligence spaces, I hope it is the fact that most of these challenges are more human in nature than they are technical. They tend to derive from the difficulty in making tradeoff decisions, standardizing interfaces and architectures, identifying and focusing on the problem space, and understanding how people may actually use these systems to greatest effect. Because of this, these challenges are often at least as susceptible to design solutions as they are to pure technical solutions. There is a tendency in the industry to focus on technical answers to these challenges over design answers, perhaps because technical solutions are often more impressive and in some sense physical. I think that's unfortunate.

Toward an analysis of datawarehouse and business intelligence challenges - part 1

July 23, 2012 by Ethan Jewett

(This post is a bit of a blast from the past. It was original published in 2010 and was lost during a migration in 2011.)

Since I work in and around datawarehousing and business intelligence, I've developed notes and thoughts over the years on the key challenges in these areas. New technologies and architectural approaches are drastically changing the landscape of the field and can help to address some of these challenges, but enterprise software vendors and customers are often not aware of new approaches or their applicability to the classic problems of the field, which continue to persist.

I'm starting to compile this list publicly, in what I hope will be a more-or-less living document, because I will start using it to evaluate the applicability of newly maturing technologies (in-memory, non-relational or NO-SQL databases, etc.) and architectures (map-reduce, streaming datawarehousing, etc.) to these old problems. This list is a survey, not an in-depth analysis of the problems. I may provide more in-depth analyses if it seems relevant, but I will more likely look for and point to references where they are available.

This is about half of my initial list and is in no particular order. I'll post the second half of the list shortly.

Data volume

This is a classic datawarehousing problem, often addressed through data modeling, aggregation, or compression. Even though it is one of the oldest problems of the field, it is by no means solved, or even optimally addressed. Enterprises continue to struggle with the cost and technical feasibility of scaling up their datawarehouses, often due to limitations of the underlying database technology.

Data quality

We may seem to be able to procure and store all of the data necessary, but that is no guarantee that the data is correct. This challenge has more to do with data being wrong than with data being misunderstood or semantically misaligned, though this is a related problem. Data quality issues can arise for many reasons including incorrect data at the point of entry, incomplete data, duplicate data, or data that becomes incorrect because of an invalid transformation.

Data consistency

Even when data is correctly stored in a datawarehouse, it may become temporarily inconsistent under certain operations. For example, when deleting or loading data, there may be a period of time when queries can access part of the data being loaded or deleted, but not all of it. This can be thought of as an inconsistent state, and while most datawarehousing tools ensure consistency in some manner, this is an area that may sometimes be traded for better handling of another challenging area. The classic tradeoff is between consistency, loading performance, and query performance.

Semantic integration

An oft-overlooked, but extremely important concept, semantic integration challenges comes in two flavors: Homonyms, meaning data that has the same name but different meanings (your "Revenue" may not be the same as my "Revenue"). Synonyms, meaning data that has the same meaning but different names.

Historical data

Dealing with historical data is a challenge that could be subsumed under other challenges. Usually the problems here are mostly issues of handling volume, master data management (changing dimensions), and semantic integration. However, historical data brings some unique angles to these challenges, including possible relaxation of requirements around performance and detail, as well as new legal, audit, and governance requirements around retaining or discarding specific data-sets.

Unstructured data

Datawarehouses have always focused on structured data, primarily because of a complete lack of tools for handling unstructured data rather than because of a philosophical view that unstructured data does not belong in a datawarehouse. This is not to say that the philosophical view does not exist, but rather that the philosophical view derives from an inability to execute rather than any underlying principle, and so should be ignored in light of new tools.

Unstructured data brings with it design constraints and requirements that do not normally appear in datawarehousing discussions. These include a lack of design-time information about dimensionality, the existence of non-numeric "key figures" (text- or image-based data, for example), document-oriented data, and the need for full-text search. Additionally, the challenge of derived dimensions and measures is strongly related to unstructured data, as these are key tools for allowing us to derive structured reporting and analysis from unstructured data-sets.