Store everything: the lie behind data lakes

February 04, 2015 by Ethan Jewett

I hope the data lake idea has passed the peak of its current, unfortunately large, round of hype. Gartner came down on the concept pretty hard last year, only 4 years after the term was coined by Pentaho's James Dixon. More recently, Michael Stonebraker illustrated many of the same concepts from the perspective of a data management professional (note the excellent comment by @datachick). The frustration of conscientious data professionals with this concept is palpable.

the initial worry - wasted effort

The idea of a data lake is that we establish a single huge repository to store all data in the enterprise in its original format. The idea is that the processes of data transformation, cleansing, and normalization result in loss of information that may be useful at a later date. To avoid this loss of information, we should store the original data rather than transforming it before storing it.

This is a relatively noble goal, but it overlooks the fact that, done correctly, these processes also introduce clarity and reduce false signals that are latent in the raw data. If every data analysis project needs to recreate the transformation and cleansing process before doing their analysis, this will result in duplicated work and errors. Long live the datawarehouse.

The deeper problem - adding information

In my reading, this sums up the usual justification for data lakes as well as the source of frustration usually expressed by data professionals. But I think there is another issue with data lakes that is often overlooked: that the data lake approach implies that transformation is an information-negative operation. That transforming necessarily discards data, and therefore information, from the original data set. It is a response to a common frustration with datawarehousing - that the data in the warehouse doesn't quite answer the question we are trying to ask and if only we could get at the source data we could reconstruct the dataset so as to be able to answer the problem. Sometimes true.

Usually, however, there are real issues with this approach. The transformation from raw data to datawarehouse (or other intermediate representation) may remove information, but it adds other significant information to the source data. Specifically, it adds or acts on information about what data can go together and how it can be combined, how the data collection method may have influenced the data, sometimes weights data according to external information about its quality, might add interpolated data, etc. Usually this additional information is not stored along with the source data set. It's rare that it's stored in an accessible way along with the resulting data in a datawarehouse. I've seen almost no mention of maintaining this type of information in the context of a data lake. It is simply lost, or at best difficult to access.

Existential questions - If a tree falls...

@esjewett @BoobBoo @oswaldxxl what is data and what is noise is an existential question :)
— Vijay Vijayasankar (@vijayasankarv) January 31, 2015

Which brings me to the big lie around data lakes: storing everything. If a measurement is never taken, is it data? How many measurements are not taken for each measurement that is? Can a data lake store these un-measured measures?

The issue in data analysis, I find, is not so much that data to answer our question was collected but discarded. Rather, the problem is that the data was never collected, the measurements never taken, or taken improperly for use in addressing the question now at hand. Like the traveller in Frost's poem, we have limited resources. Collecting one measure is a road that precludes collecting other measures due to these limits. Storing everything, or even most things, is not possible.

“Two roads diverged in a wood, and I—
I took the one less traveled by,
And that has made all the difference.”

— Robert Frost, "The Road not Taken"

Later, we often wonder if taking the other road would have allowed us to answer our current questions, but we can't go back and make different decisions about measurement. Data lakes don't change that. Maybe data lake vendors should switch to selling time machines.

Palladio

April 09, 2014 by Ethan Jewett

I've been working on this for a while, so it's probably worth posting something about it.

Palladio is a platform for data visualization and exploration designed for use by humanities researchers. We're in early beta at the moment and will be doing a series of releases throughout 2014. You can read about it and try it out here: http://palladio.designhumanities.org/

I think the website does a pretty good job of explaining the capabilities of the platform, so I'll leave that for the moment. I encourage you to go check it out before reading on, because it will be worth understanding how the platform works to give some context to the discussion below.

So, why am I super-excited about this? Mostly because of the great team, which has the vision, technical skills, theoretical and domain knowledge, and information design chops to pull off this type of project. I consider myself lucky to be able to work with this group.

It's also great to be working on a project like this for a field that is simultaneously very strong on information theory and a bit underserved in terms of some types of tools. This is in stark contrast to my usual enterprise data management and visualization work where the theory tends to be weak but a plethora of tools exist.

In addition to trying to build a tool that incorporates important and underserved aspects of humanistic inquiry, I am excited to work with a team that buys into introducing state-of-the-art concepts around data exploration tools in general. Many of the concepts we are working to implement in Palladio are directly applicable to the types of data exploration problems we find in the enterprise and are concepts rarely expressed in existing tools. Palladio is a great example (one of many great examples) of how the process of humanistic inquiry can motivate the development of methods that are both technically and conceptually applicable in wildly different disciplines.

Interaction

The thing that initially most impresses people about Palladio is the way that filtering and movement are integral to the visualization. Specifically, the visualizations update and move in real-time as you filter. This is not a new concept, but I don't think I've ever seen it fully implemented in a general-purpose tool. Getting the level of movement right is a design challenge that the team is tackling as a work in progress, but in my opinion this characteristic of real-time updates and movement is a key feature for a data exploration tool, and few if any tools implement it.

I'll try not to get too squishy here, but this behavior of the tool allows a person to interact with the data in a very direct way, giving a feel for the data that would not otherwise exist. When you can see the results of your interactions with the data in real time, it is a lot easier to conceptually link step-changes and interesting events with the actions that caused them. For example, dragging a filter along the timeline component allows you to play back history at your own speed, speeding up or slowing down as suits you. My theory-foo is weak, but when you see it, I think you'll understand. Try it out with the sample data.

Browser

Techy alert: Palladio is a purely client-side, browser-based application. The only involvement of a server is to deliver the HTML, Javascript, and CSS files that comprise Palladio. We arrived at this design through a few iterations, but the motivation was that we wanted to be cross-platform, super-easy to install, and still support pretty big data sets and fluid interactions. 10 years ago, this would have been nearly impossible, but today we have web browsers that, amazingly, are capable of supporting these types of applications. Yes, browsers can now support dynamically filtering, aggregating, and displaying 10s and 100s of thousands of rows of data and displaying hundreds of data points simultaneously in SVG; thousands of data points if you use canvas instead.

The time for client-side data visualization in the browser has come and we are taking advantage of that in a big way. A great strength of browser-rendered visualizations is that they allow true interaction with the visualization. Just using SVG or Canvas as a nicer replacement for static images is fine, but it isn't fully exploiting the medium. Add to this that the type of interactivity we are providing with Palladio is technically impossible in a client-server setup. Even if the server responds to queries instantaneously, the round-trip time the client-server communication introduces means that interactions won't be as closely linked as they are in Palladio, severely degrading the quality of the interactive experience.

Admittedly, we have work to do on performance and our cross-browser support could be better. Additionally, the problem of data that simply doesn't fit in the browser's memory remains unaddressed, though we have some ideas for mitigating the problem. But I think this is an application design approach that could be exploited for the vast majority of data sets out there, either because the data itself is relatively small, or through judicial use of pre-aggregation to avoid performance and size issues.

Design

Lastly, user experience and information design have been integral components of this project from the start. The design has been overhauled several times along the way, and I wouldn't be at all surprised if it happened again. To be clear, I'm a complete design newb, but we have a great designer working on the team. One thing that has become clear to me through this process is that designing a general purpose interactive visualization tool is hard. There are more corner-cases than I previously imagined possible, but we are trying to get the big pieces right and I think we're on the road to success.

Obviously the organizational dynamics on a small team like ours are very different than those in a big development organization, but it seems like information design on most of the enterprise data exploration tools from larger vendors either started out problematic and stayed that way, or started out pretty well and started slipping as the tool took off. I'm not sure if there is an answer to this, but it's clear that when building a tool in this space, having at least one information designer with a strong voice on the team is indispensable.

Let me sum up

So, that's the Palladio project, as well as a few takeaways that I feel can be applied back to pretty much any data exploration project. In closing, I'll just mention that none of this would be possible without some great open source projects that we use and, most importantly, without the great team working on this as well as the feedback and patience of our dedicated beta participants. The central Javascript libraries we used to pull this off are d3.js, Crossfilter, and Angular.js. The members of the core team are Nicole Coleman (our fearless leader), Giorgio Caviglia (design and visualization), Mark Braude (documentation, testing, working with our beta users, and project management), and myself doing technical implementation work. Dan Edelstein is the principal investigator.

It's been a great ride so far and we've got some exciting things planned for the rest of the year. This is definitely a work in progress, and feedback is very welcome. Follow the Humanities + Design lab on Twitter for updates.

ABAP development - the dream and the reality

February 10, 2014 by Ethan Jewett

Late last week I did a podcast with Jon Reed and Graham Robinson on SAP's developer engagement initiatives. You can watch that here.

This discussion, and the fact that I'm back doing some serious ABAP development after a lot of non-ABAP work, got me thinking about the value SAP's application ecosystem contains that developers would love to add to, to SAP's gain. But it's very difficult to add value in this ecosystem as a developer because of (among other things) the weaknesses of the ABAP development environment. What, I asked myself, would a truly strong ABAP development environment would look like. Here are a few thoughts:

ABAP system as a decentralized version control system (DVCS) node - An ABAP system should act as a true DVCS node. This is a good explanation of the nirvana you too could experience with a DVCS, with pictures of a cute dog. In short, an ABAP system should, at the package level or at the full system level, be able to commit changes, push those changes to a remote repository, pull changes from a remote repository, maintain branches, and merge (with user input if necessary) those branches. These descriptions use Git parlance, because that is what I know.
Branch and merge - Related to behaving as a DVCS, the ABAP system should be able to support full branch & merge semantics. I'm not sure how well I can explain this. I'll just tell you, it's a great feeling to have the ability to create a branch of a complex project, rip out an integral component, completely rewrite it, merge those changes back into the main development branch, and have none of the other 3 people working on closely related parts of the project notice any change expect that things just work better. I know. I did it last week. Oh, and I simultaneously made fixes as they were requested on other branches of the project and pushed those out to my co-workers while I was working on that rip-and-replace. Good luck pulling that off on an ABAP development system.
Namespaces - SAP should make namespaces available to open source projects for free, with minimal bureaucracy. And while you're at it, just do the same for everyone. What do I mean by "minimal bureaucracy"? A web form and account system that saves the namespace to a database after checking that no one else has already registered the same namespace. Why are namespaces important? They address the problem of naming collisions between development projects. Right now, the open source ABAP community mostly develops in the customer namespace. This is a recipe for naming-collision hell. But what other options do people have? SAP should to fix this, and fast. And while you are at it, make the namespaces versioned (see below).
Improve SAPlink & ZAKE, and include them in the standard distribution - SAPLink and ZAKE are the defacto standard ways to develop and distribute open source ABAP code. They are massive improvements over the status quo, but they aren't very widely installed, and they are not perfect. If SAP wants a healthy ABAP developer community, I'd recommend that SAP join these projects, contribute to them, and make them awesome. Then include (up-to-date) versions of them in the standard Netweaver ABAP distribution.
Speaking of up-to-date, implement ABAP package and dependency management - Maven is kind of a nightmare, but it works. Let's say I want an mock/fake library like MockA (highly recommended) to use in my ABAP Unit tests. I go out and find it on Github, install SAPLink (which doesn't work properly so I have to go back and activate everything), download the .slnk file, install that using SAPLink (which doesn't properly install some interfaces, so I have to reinstall it), and finally I can use it in my project. But I find a bug! So I report that bug to the maintainer, who is awesome and fixes it in less than a day. My reward? I get to do the whole download, install, install all over again. We should just be able to list library versions as dependencies of our packages, classes, and test classes, and the system should handle downloading and installing them. But this will result in organizational chaos, you say. No, it won't. Because everyone will be using namespaces, and those namespaces will be versioned, so we can have more than one version of a library installed and usable at a time.

Right now it's a bit frustrating to switch between ABAP and non-ABAP development. The tooling in the Javascript and Java ecosystems is so far superior and so much more concerned with developer productivity that it begins to be painful to return to ABAP. I can't imagine what it would be like to come to the ABAP ecosystem with no background.

This is a shame, because it's a nice language in a lot of ways, and the business applications that are accessible through ABAP are an absolute goldmine of information and business process execution logic that loads of developers would love to add value to, in the process creating a great deal of value for SAP. If they could only be productive on the platform. I'd love to see SAP take ownership of the developer experience and make the ABAP ecosystem flourish.

Gullible finance?

May 15, 2013 by Ethan Jewett

I own AAPL, and I have no idea what it's actual valuation should be, but when I saw this article about an analyst who thinks AAPL should be valued at $240 coming in over the Daring Fireball feed, I couldn't help but look.

What could this guy possibly be thinking? What are the financial reporters that give him a platform thinking? I'll admit, I couldn't help it. I don't know much about this stuff, but I dug in a little bit to try to figure it out.

Of course, the fact that David Trainer's PDF calculation pegs Apple's Net Operating Profit After Tax (NOPAT) for 2012 at under $11 billion is a little fishy since it was actually around $41 billion. Turns out, the analysis is based on a "what if" scenario assuming that Apple had a Return On Invested Capital (ROIC) for 2012 of 70% and for 2013 (to date) of 52%. These drive a calculated Economic Book Value per Share around $240, which is apparently Trainer's target.

Of course, Apple didn't have these ROICs, by Trainer's own calculations, so it's a bit foolish to run a "what if" scenario on them. The actual ROIC for 2012, shown here, was 271%. 270% vs. 70%, but who's counting? God knows how Trainer is calculating the "Total Capital" denominator in the ROIC calculation. It looks pretty close to Working Capital. But for 2013 I'd expect the ROIC to drop somewhat versus the 2012 level even if Net Income remains constant. This is because dividends have gone up and Total Capital should as well, according to Horace Dediu's analysis on Asymco. But for ROIC to drop to 52% by Trainer's calculations would take a truly massive drop in income or a much larger spike in capital expenditure than is indicated by Apple's 10K.

But really, I don't know anything about this stuff. Don't take my word for it. Take Trainer's. Given Apple's actual ROICs, what does David Trainer's analysis indicate the Economic Book Value per Share is today? Well, he conveniently cuts off this PDF one line before he reveals his value, but if we do a little math based on the Price to Economic Book Value per Share ratio we get:

$443 / 0.56 = $791

(Don't worry, I've saved the PDFs in case Trainer decides he needs to cut off a little bit more of the unadjusted calculation.)

Now, of course, there is a decent chance that Apple's ROIC will decline to something a little closer to its rivals over the next several years, but it's exceedingly unlikely it will fall off a cliff in 2013 and hit 52%. And it's a bit more than disingenuous to retroactively assume that a company reported different financial performance than it actually did.