A Critique - mistakes comparing architectures for big data

Last fall, I was delighted to see a blog on SDN titled, Parallel DBs faster than Google's MapReduce. Finally someone was seriously looking at the issue! My delight quickly turned to dismay as I read the blog and paper it referenced: "A Comparison of Approaches to Large-Scale Data Analysis" (1).

"Something must be horribly wrong", I thought, "because this paper is saying that a Google search should take minutes or hours but I'm pretty sure that when I go to google.com on a fast internet connection I'm quite sure that the search takes less than a second." Hey, what can I say? I've always liked proof by contradiction.

This led me to the immediate conclusion that the blog was bogus and an ongoing suspicion of SAP's marketing efforts around "in-memory", leading to the occasional Twitter outburst or blog. But it's time for a more detailed analysis of what actually went wrong with the paper's conclusions and the blog's inferences, and more generally to review some mistakes that are often made in this area, as I develop an approach for analyzing platforms for large-scale datawarehousing, BI, and analysis.

Comparing apples and filesystems

The fundamental mistake of the paper (far more pronounced in the blog) is to compare two approaches to data analysis at completely different levels of the stack. Specifically, it compared the Map Reduce approach (specifically HDFS - the Hadoop File System) to DBMSes optimized to the workload of heavy data and analysis. The fact that the "FS" in HDFS stands for "File System" should have tipped off the researchers that they should have been comparing HDFS to distributed file-systems, not to distributed DBMSes.

The valid parallel to a distributed DBMS in the Map Reduce ecosystem is Google's Big Table or HBase, the Hadoop database, which stores its data in HDFS just as DBMSes store their data in files on a local or distributed filesystem. The researchers can perhaps be forgiven for their mistake, as HBase was relatively immature at the time of publication of the paper, but I would have expected at least some mention of BigTable, since the BigTable paper (3) had been out for a while by that point.

Relative maturity

The second mistake was to take two solutions at highly disparate levels of maturity as representative of two approaches. HDFS is almost certainly less mature than the equivalent internal Google tool GFS (the Google File System).

Similarly, we can be quite certain that HBase is significantly less mature than Google's Big Table. Performance results of HBase 0.19 lag the results reported in Google's Big Table paper by a large margin. (HBase 0.19 is based on the version of Hadoop/HDFS that the researchers tested against.) Meanwhile, we now have performance results from HBase 0.20 indicating a large performance improvement, while still lagging the originally reported Big Table results.

I take this quick improvement in HBase performance as well as the continuing lag of Big Table results to indicate that there is a lot of room for performance improvement in open source columnar database solutions based on Map Reduce architectures. There are reasons for this belief beyond simply trend-lines, such as the fact that HBase doesn't yet implement well-understood performance-improving architectures for columnar databases described in the linked performance evaluation of HBase 0.20.

When the goal is to compare basic architectural approaches, solutions should be handicapped according to their relative maturity, so as to avoid discounting advantageous architectural patterns that lag because they are not yet highly optimized.

Eliding the workload issue

Lastly, the paper draws conclusions about "Large Scale Data Analysis" based on workloads consisting primarily of text parsing and searching. Admittedly, a good chunk of large scale data analysis does fall into the text realm, but I can vouch that there is a lot of large-scale data analysis going on out there that is very numerical in nature.

Other papers (3, 4, 5) have notably shown that differences in workloads can result in huge differences in optimal architectures. As such, I think that at best (disregarding the other two issues I've outlined), this paper can be promoted as a comparison of approaches to large scale text data analysis.

(As an aside, I'm a bit perplexed that the primary paper I am referring to has as last author the first author of papers referenced as 3, 4, and 5 below but proceeds to make the very mistake that the author's previous papers were outlining.)

I think that about sums it up.

References

  1. A. Pavlo, E. Paulson, A. Rasin, D. Abadi, D. DeWitt, S. Madden, M. Stonebraker. "A Comparison of Approaches to Large-Scale Data Analysis", SIGMOD'09, June 29-July 2, 2009, Providence, Rhode Island, USA. Copyright 2009 ACM 978-1-60558-551-2/09/06
  2. Google paper, Bigtable: A Distributed Storage System for Structured Data http://labs.google.com/papers/bigtable.html
  3. M. Stonebraker, S. Madden, D. Abadi, S. Harizopoulos, N. Hachem, P. Helland. "The End of an Architectural Era (It's Time for a Complete Rewrite)", VLDB '07, September 23-28, 2007, Vienna, Austria. Copyright 2007 VLDB Endowment, ACM 978-1-59593-649-3/07/09.
  4. M. Stonebraker and U. Cetintemel. "One Size Fits All: An Idea whose Time has Come and Gone." In Proc. ICDE, 2005.
  5. M. Stonebraker, C. Bear, U. Cetintemel, M. Cherniack, T. Ge, N. Hachem, S. Harizopoulos, J. Lifter, J. Rogers, and S. Zdonik. "One Size Fits All? - Part 2: Benchmarking Results." In Proc. CIDR, 2007.
Syndicate content