Why in-memory doesn't matter (and why it does)

Well, that title was going to be perfect flame-bait, but then I went all moderate and decided to write a blog that actually matters. So here's the low-down:

There's a lot of talk lately about in-memory and how it's the awesome. This is especially true in the SAP-o-sphere, primarily due to SAP's marketing might getting thrown behind Business Warehouse Accelerator (BWA) and the in-memory analytics baked into Business ByDesign.

I'm here today to throw some cold water on that pronouncement. Yes, in-memory is a great idea in a lot of situations, but it has its downsides, and it won't address a lot of the issues that people are saying it addresses. In the SAP space, I blame some of the marketing around BWA. In the rest of the internet, I'm not sure if this is even an issue.

Since I've actually done a fair amount of thinking about these issues (and as a result I troll people on Twitter about it), I thought maybe it'd be helpful if I wrote it down.

So let's get down to brass tacks:

How in-memory helps

In short: it speeds everything up.

How much? Well, let's do the math:

Your high-end server hard drive has a seek time of around 2 ms. That's 2*10^-3 seconds (thanks Google). Yes, I'm ignoring rotational latency to keep it simple.

Meanwhile, fast RAM has a latency measured in nanoseconds. Let's say 10ns to keep it simple. That's 10^-8 seconds.

So, if I remember my arithmetic (and I don't), RAM is about 2*10^5, or 200,000 times faster than hard disk access.

Keep in mind that RAM is actually faster because the CPU-memory interface usually supports faster transfer rates than the disk-CPU interface. But then, hard disks are actually faster because there are ways to drastically improve overall access performance and transfer rates (RAID, iSCSI? - not really my area). Point is, RAM helps your data access go a lot faster.

But ... er ... wait a second (or several thousand)

So here I am thinking, "Well, we're all fine and dandy then. I just put my job in RAM and it goes somewhere between 100,000 and 1,000,000 times as fast. Awesome!".

But then I remember that RAM isn't a viable backing store for some applications, like ERPs (no matter what Hasso Plattner seems to be saying) or any other application where you can't lose data, period. Yes it can act as a cache, but your writes (at least) are going to have to be transactional and will be constrained by the speed of your actual backing store, which will probably be a database on disk.

And then I see actual benchmarks meant to reflect the real world like this. For those who won't click the link, the numbers are a bit hard to read, but I'm seeing RAM doing about 10,000 database operations in the amount of time it takes a hard disk store to do about 100. That's only a 100x speedup.

Ok, now I'm back down to earth and I'm thinking, "I just put my job in RAM and I'll get maybe a 50-100x speedup but at the cost of significant volatility". (I'm also thinking that SAP's claimed performance improvements of 10x - 100x sound just about like what we'd expect.)

This is still really really good. It makes some things possible that were not possible before and it makes some things easy that used to be hard.

And finally, why in-memory doesn't matter

But really, what is the proportion of well-optimized workloads in the world? How often are people going to use in-memory as an excuse to be lazy about solving the actual underlying problems? In my experience, a lot. Already we are hearing things along the lines of, "The massive BW query on a DSO is slow? Throw the DSO into the BWA index." [Editor's note: A DSO is essentially a flat table. Also, the current version of BWA doesn't support direct indexing of DSOs, but it probably will soon, along with directly indexing ERP tables.]

Now's the part where we who know what we're doing tear these people to shreds and tell them to implement a real Information Lifecycle Management system and build a Inmon-approved data warehouse using their BW system (BW makes it relatively easy). Then that complex query on a flat table that used to take two days of runtime will run in 30 seconds.

Well, that would be one approach, but frankly most people and companies don't have the time or the organizational maturity in their IT function to pull this off. And in this world, where people have neither the time nor the business process for this sort of thing, then it starts to make sense to spend money on it, and something like BWA is a great thing in this context.

But it's not great because it's in-memory. It's great because it takes your data - that data you haven't had the time to properly build into a datawarehouse with a layered and scalable architecture, highly optimized ROLAP stores, painstakingly configured caching, and carefully crafted delta processes - and it compresses it, partitions it, and denormalizes it (where appropriate). Then, as the icing on the cake, it caches the heck out of it in memory.

Let's be clear: BW already has in-memory capabilities. Livecache is used with APO, and the OLAP cache resides in memory. The reason BWA matters is not that it is in-memory. It matters because it does the hard work for you behind the scenes, and partially because of this it is able to use architectural paradigms like column-based stores, compression, and partitioning that deliver performance improvements for certain types of queries regardless of the backing store.

In-memory is great, and fast, and should be used. But in most ways that are really important, it doesn't matter all that much.

What's the deal with JAX-RS and Lift?

There has been some talk lately (from the Java performance maven @kohlerm and others) about JAX-RS in the context of the API of a Lift application - specifically ESME. JAX-RS is a Java annotation framework for programming RESTful web services (APIs for the non-enterprisey out there).

The real goal of the talk about JAX-RS with regards to ESME (as I see it, and I'm not the only point of view on this by a long shot), is to create an API that is as RESTful as reasonably possible, and which (for most resources) is indistinguishable from a JAX-RS implementation from the perspective of a client consuming the API.

It appears at first glance that the easiest way to achieve the goal of indistinguishability from a JAX-RS implementation is to do a JAX-RS implementation. I'm not convinced it is so straightforward.

Here are the requirements for a platform for implementing an HTTP API for ESME, in my view:

  1. Access to ESME instance objects as instantiated in the Lift application (preferably we are not using the database as the way that the API and the application communicate, especially since ESME doesn't necessarily even have a DB in my understanding)
  2. Respect the ESME security architecture (meaning that the API must act as a specific user)
  3. Support delta/streaming collections over HTTP in addition to REST-only resources and collections

Meanwhile, I've found a few instances of people attempting to use JAX-RS in the context of a Lift application. Unfortunately, I don't think they actually meet these requirements.

Here (http://blogs.sun.com/sandoz/entry/using_scala_s_closures_with) we have a discussion of how we can implement JAX-RS provider classes in Scala instead of in Java. This is, admittedly very attractive, but I'm not seeing a way to import the entire Lift application context into these provider classes. Or rather, enough of the context to satisfy my requirements 1 & 2 above.

I had a fleeting moment of hope when I saw lift-jersey and this thread where we have James Strachan writing a Lift module that appears to allow us to use the Lift templating language within a JAX-RS implementation in Scala. However, it looks to me like this only allows using the Lift templating language, and by and large replaces the Lift request-handling stack with JAX-RS. This is sort of like what we want to do, but not really, and I think we're going to have the same struggles with integrating the Lift application itself that we would have had with the first example.

Where we'll really start running into trouble, even if we can satisfy requirements 1 & 2, is in requirement 3. In ESME, we some resource collections that need to be provided in a "delta" or "streaming" format to our clients, but also provide more orthodox RESTful HTTP interfaces. JAX-RS doesn't appear to be very friendly to the delta/streaming problem-space, so we will actually be forced to implement the streaming parts of these resources in Lift/Scala directly. This means that we can't just cordon off a portion of the URI-space of the ESME application to be served by Jersey or another JAX-RS container, outside of the context of the Lift application. The two containers would need to be very closely intertwined in the URL-space of the ESME application. In other words, we'd need to pattern-match before the request even hits Jersey and JAX-RS, which kind of defeats the purpose.

So, that little exploration was relatively fruitless, but maybe this blog will attract some comment setting me straight on how to do this.

Meanwhile, I had the thought "What's so great about annotations anyway?" My first impression is that there is nothing so great about annotations. First impressions are usually wrong, but it got me thinking along the lines that what JAX-RS does with annotations is suspiciously similar to what the Lift request dispatcher does with pattern matching. Here, let's loosely borrow an example from the lift-jersey Github page.

import javax.ws.rs.{Produces, Path, GET}
/**
* @version $Revision: 1.1 $
*/
@Path("/resourceReturningTemplateView")
class ResourceReturningTemplateView{
  @GET
  def view() = <xml_container>Some Text</xml_container>
}

So, all this says is that when we receive a GET request to the path "/resourceReturningTemplateView" in our application, return the results of the view() method, which in this case is a string of XML.

What does this look like in a native Lift? Well, ignoring the other Lift incantations that need to be done (which are really just one line in Boot.scala, a class definition, and an implicit function that converts from a NodeSeq or Elem to a LiftResponse), all that's required is something like

def dispatch: LiftRules.DispatchPF = {
  case Req("resourceReturningTemplateView" :: Nil, _, GetRequest) =>
    <xml_container>Some Text</xml_container>
}

Lift request matching appears to be pretty powerful, so I started looking at whether it can cover the use cases of JAX-RS. The upshot is that I think it does pretty much everything we need, so I'd prefer to stick with the Lift pattern matching over JAX-RS annotations for the time being. Some things may be a bit more complicated in Lift, like the @Consumes annotation and the Allow header value, or direct handling of form fields as specified in the @FormParam annotation. However, I think these are all doable and just require good patterns be developed.

Final thought: I'm pretty sure that JAX-RS isn't really an API for RESTful web services. It's an API for pattern-matching HTTP requests in a language (Java) that doesn't have native pattern-matching.

Final disclaimer: No actual code was harmed, or tested for that matter, in the writing of this blog. In other words, that code up there probably doesn't work, and that's nobody's fault but my own.

Statistical misunderstandings and Google Book Search

Well, there I was, chatting away with (hopefully still) a friend about configuring a Twitter search widget for a blog and up comes the Google Books meta-data topic, in the form of a link to Google Books: A Metadata Train Wreck. This is the kind of article that really gets at me. Which is to say that it is an article holding a position that I mostly agree with in principle, but which does the position the disservice of making pretty questionable arguments.

So with that, I'll lay out two (or 5) things that really get my goat when people start talking about large data sets and meta-data.

The misunderstanding about the whole point

There seems to be a misunderstanding between the scholarly community and Google about what Google Books actually is. I think it is this misunderstanding that leads to claims like the one about "miscategorization" of translators as authors. It seems clear to me that the Google Books team made a conscious decision to put translators and authors into the same search field. As such, it's not an error so much as it is a semantic disagreement. I'm willing to bet that Google maintains author and translator metadata separately on the backend and just concatenates the fields for the purposes of searching.

In brainstorming, I came up with a few things Google Book Search and the whole library digitalization project might be, from Google's perspective:

  1. A vehicle for advertising and referral revenue for Google.
  2. A project to create a dataset that Google can use to train its translation and semantic knowledge engines.
  3. A dumb idea that will never help Google and will eventually be abandoned.

It's possible that Google really conceives the project as at least partially having the goal of enabling scholarship, but I think #2 is probably the real goal, having watched how Google works for about a decade now. If I'm right about this, then Google Book Search is just a cover; a way of making the data gathered from libraries publicly available. If #1 is right, then having the author and translator in the same field is exactly the right thing to do, since lots of people will mistakenly search for the translator instead of the author.

Would it be good for Google to provide a scholarly interface to Google Books metadata that splits out translator and author? Yes. Does Google have a responsibility to provide this interface? I don't see how.

The misleading or useless appeals to statistics and data

The second point is perhaps the more important one: There is often a misunderstanding or ignorance of statistical methods and the type of weird stuff that happens within large data-sets. I see this occasionally when the scholarly/library community blogs about Google Books, but I see it in a lot of other places too. This (I assume) ignorance results (again, I assume) in a belief that appeal to common sense or personal experience is a valid argumentative technique when large data sets are in play. From experience with large data sets, I have a tendency to believe that common sense and personal experience are worse than useless when large datasets are involved. In any case, this belief results in two major problems and we see examples of both in the above-linked blog.

Statistical problem #1: Claims about the number of errors in a particular set of meta-data without any comparison to existing meta-data sets that might give us a baseline of the number of errors we may expect. For example, if there are 2 errors of type X per 1000 records in Google Books, that doesn't tell me anything unless I also know that there are 0.3 errors of type X per 1000 records in the UC library system. That tells me Google Books has some catching up to do. On the other hand, maybe there are 5 errors per 1000 in the UC library system, in which case Google Books is doing pretty good. The blog (and most of these sorts of blogs) fails to give any useful metric for me to compare against.

Statistical problem #2: The sample size is usually left out. If there are 300 errors of type X in Google Books and only 100 errors of type X in my local public library ... well that probably means that Google books is 100x better than my local public library because it's got a lot more books in it. But I can't tell because no one bothers to say how many books are in Google Books and how many are in my local public library.

In summary: A metric like "572 errors of type X" is useless because I don't know the sample size or have anything to compare to. "2 errors of type X per 1000 records in Google Books as opposed to 0.4 errors of type X per 1000 records in the Harvard University library system" on the other hand is incredibly useful as it provides a basis for comparison and understanding.

To be completely clear

Again, I'm sympathetic to the desire for a great digitally accessible and searchable book repository. I'm sympathetic to the underlying concern about the devaluation of knowledge and education that drives a lot of these arguments (for example, this blog, which said friend later forwarded). But arguments made in an undisciplined and sometimes misleading manner actually do more to hinder the cause than to further it, at least in my mind.

The infrastructure coop and the web

(In which I take a moment to meander on a topic about which I know very little.)

In Doc Searls' recent writing on "The Ultimate Alignment" he writes on the disconnect between the interests of the companies that are building the web and the people who are using it. To summarize, companies have (or are supposed to have) their investor's best interests in mind rather than the best interests of their customers and users. 

Theoretically, our efficient market system is supposed to align these two groups of interests. Or at least mediate between them. In theory, theory and practice are the same. In practice they tend to differ and one group of interests has a tendency to be subverted. When it comes to situations like the recent Facebook acquisition of Friendfeed or cable and telecom companies colluding with the recording industry to restrict peer-to-peer connections, the disconnect between the interests of users and investors comes into stark relief, and we see that users can be left out in the cold when interests conflict.

Customers have generally been getting the short end of the stick, or the pointy end if they're especially unlucky. This can, and probably should be attributed to a lack of customer organization and a resultant lack of bargaining power; something that is starting to be addressed with varying success by the social web, and which Searls sees culminating in a real capability for direct customer negotiation, which he calls Vendor Relationship Management.

A somewhat alternative approach proposed by Dave Winer, which Searls writes about in the above linked article and to which he appears sympathetic, is the tale in which the customers seize the reins of power and re/self-organize companies into a sane and impeccably moral engine of economic progress, for the good of all. [Ed. Sometimes he has a hard time keeping the lid on the sarcasm, especially in entries written on airplanes. Must be the altitude.]

Despite the sarcasm, I'm incredibly sympathetic to this idea, so after all of that throat clearing, I'd like to think about how it would work. What Searls and Winer are suggesting is a customer-owned coop.

Coops are cooperatives

My understanding of a customer-owned coop is that it is a business which is wholly owned and capitalized by its customers and the purpose of which is to provide a specific service rather than to turn a profit. If additional capital beyond what is required to deliver the service is generated, that capital is returned to the customer/owners. If an existing customer would like to become an owner then they can invest the requisite amount in the coop that is required to receive an equal share. Usually there is some ownership incentive provided, such as the ability to make use of the coop's services, or a discount on those services.

There are lots of kinds of coops, from housing and utility coops with services available only to owners to grocery stores and book stores which operate as commercial entities and often provide only slim discounts to owners. Public companies and governments are arguably coops, though they are not usually acknowledged as such and are usually not strictly customer/user-owned. (Though this is usually fairly close to the truth in the case of democratically elected governments with an electoral process that is relatively uninfluenced by monetary concerns ... and we know there are plenty of those around. [Ed. This sarcasm is getting a tad out of control, isn't it?])

Coops are often formed to establish some vital piece of infrastructure that a community feels is not adequately provided by existing commercial or governmental entities. A community that is lacking a quality grocery store will sometimes establish one as a coop. A university community that feels its access to academic books is restricted or provided at far too high a price by a commercial book store might establish a coop bookstore, effectively buying itself a distribution channel that did not exist before. Same goes for utilities.

In this way of thinking, coops are usually focused on infrastructure, with a special focus on distribution infrastructure during the 20th century. Distribution of food, books, natural gas, or electricity.

Open source is a bit like a coop ... and how

Open source software development is like a coop. The buy-in is contribution, of code, testing, bug-wrangling, documentation, publicity, or community process. The ownership privileges are varied. Contributions of code usually buy the contributor partial ownership of the code-base and the accompanying control, depending on the license - a privilege that cannot be purchased in any other way. Same goes for contributions of documentation vis a vis the body of documentation. All other contributions buy a less well defined voice in the community process.

Open source projects often resemble coops in another manner in that they are usually established to fill a need that a community feels is not adequately met (or is not met at an appropriate price) by existing commercial offerings. Some of the most successful open source projects have been and continue to be focused on infrastructure.

Open source as barter coop and the problem therein

This ground has been pretty heavily tread before, but the above description is of a coop established via barter, rather than exchange of more abstract monetary instruments and the signing of binding legal contracts. Importantly open source projects are only open to ownership by individuals who are prepared to deal in a particular instrument of barter. Open source is a bazaar, but it is a bazaar that is only accessible to those who can pay the entrance fee in a particular currency. It is a bazaar, frequented only by merchants and large distributors, not by individual shoppers. It has the best prices and the best goods, but good luck getting to them.

And therein, of course, is a problem with the open source cooperative model. The strength of many cooperatives is that any interested user can become an owner (and often does), and this is the model that Winer and Searls would like to see, it seems. But in the open source world this is not the case. In meat-space cooperatives, and bazaars for that matter, the answer is money. Anyone with money can buy a share in a coop grocery store. You don't necessarily need to be able to contribute broccoli or organic lamb to participate. 

Coops as cooperatives (and the trouble with staying that way)

Often coops seem to work best when they are providing services to a local community. This may be partially the case because it reduces the interest in joining a coop for reasons other than taking advantage of the primary service that a coop provides. This local focus can be used as a hedge against the possibility of a group of disinterested investors taking over the coop and turning it into a profit-making operation, divorced from a direct interest in providing a service to its customers and instead focused on providing a service to its owners (now two different groups of people). Viola! Modern commercial entity.

For this reason I very much doubt that founding a new company and immediately IPOing as Winer suggests would not accomplish the goal of creating a company that is truly aligned with its users and customers to provide a service they desire or require. Some guard or incentive must exist, as is the case with coops, to keep the coop owned by its users and customers, yet still open to membership from new and interested users and customers.

Open source projects don't have the local angle going for them, but truthfully there is precious little to gain from participating in an open source project if ones goal is not to improve the project. Ego and a bullet on the resume perhaps. Through narrow focus and severely restricted ownership, these projects make sure that all owners are user/customers, even if they cannot guarantee (and in fact do not aspire to a reality) that all users are owners.

And so we come to the point

We have open source projects, which are fairly successful at making sure that they are guided primarily by input from user/customers but which can do a fairly bad job of making sure that all users/customers are well-served. We have commercial entities which do a good job of serving their owners but a bad job of making sure that their owners are also their customers, resulting in divergent interests. We have local coops, which tend to do a pretty good job of both but don't tend to scale well past a local community.

And then we're talking about essentially a distributed infrastructure coop that shares some of the best features of local coops and commercial entities. What structural features could a web-infrastructure cooperative boast that would meet the following necessary requirements?

  • Any user/customer can (and often does) become an owner
  • Only user/customers become owners (with the possible exception of a small minority ownership by other interested parties)

I don't know the answer to this question. In fact, a part of me doubts that there is an answer and that we will be better served attempting to scale community processes. But it seems it is the crux of the problem of the dis-alignment in web infrastructure.

Make a webhook out of anything

(or ... "How I learned to stop worrying about doing it right, and just make the damn thing work")

I've had this problem for a while:

I use a great service called Instapaper (try it, seriously) for keeping track of my reading list. Which is great. But I want stuff to happen to certain items once I'm done reading something in Instapaper. They should be posted to Twitter, stored in Evernote, or squirreled away in Diigo, del.icio.us, or Pinboard.

This just isn't very achievable. While Instapaper is totally awesome, it does not provide automatic posting to all (or any) of these sites. It doesn't even provide Webhooks, which might provide the ... hook ... to allow for this sort of posting via some service like Yahoo! Query Language (YQL) or Tarpipe. What Instapaper provides is an RSS feed of shared items.

I've previously reviewed how to use YQL to post to Tarpipe, which solves part of the problem. It is possible to use this technique to consume a feed in Yahoo! Pipes and then post each item to Tarpipe. But this isn't really what I want to do, because it will post each item to Tarpipe each time the feed is read. Which is going to result in a lot of duplicate Tweets, Evernote notes, or whatever else I'm having Tarpipe do.

What we need is a Yahoo! Pipe that will only call the special Tarpipe (or other) YQL when there is a new item in the feed. Pipes isn't very good at this, but in steps Google Reader. Pretty much all that Google Reader does is query a feed occasionally and keep track of when a new item appears.

My strategy (and it works, bless Google and Yahoo!'s hearts) is to use Google Reader to check a Pipe and cache the items it has seen in a publicly accessible label. This is a little circular, but the very Pipe that Google Reader is checking pulls the feed I want to webhookify and compares the contents of the feed to the contents of the Google Reader label. If the Pipe sees any items in the feed that aren't yet stored in the Google Reader label, it does its magic on only those items.

I've made the Pipe that does this public at http://pipes.yahoo.com/esjewett/feed_to_webhook_using_google_reader but getting it working take a little doing, which is described below with screenshots. I'll demonstrate using a public feed (Daring Fireball's main article feed, because just about every single article is worth bookmarking), but you could use it on any feed that Yahoo! Pipes can access. I use it on my Instapaper starred items feed, among other things.

Step 1

Determine your Google Reader User ID by selecting your Shared Items feed in Google Reader:

Once you've selected this feed, take note of the URL in the address bar of your browser. It includes a string that is your Google Reader User ID. The ID contains only numbers. That "F" in front of it is not part of the ID and the "%" after it is not part of the ID. Copy this ID down somewhere as you'll need it later.

Step 2

Create your Pipe by cloning http://pipes.yahoo.com/esjewett/feed_to_webhook_using_google_reader

Step 3

Populate the user input fields of the Yahoo! Pipe with the feed you want to use ("http://daringfireball.net/feeds/articles" in our case), the Google Reader User ID from step 1, and the label you are going to make publicly accessible in Google Reader in a later step. The label you choose is important. It needs to be a label that is used for only this purpose and only this feed. Make it unique and call it something that will remind you of its purpose. I'll call mine "Daring Fireball Articles".

Once you've done all this, click the "Run Pipe" button.

Step 4

You should at this point get a list of the latest Daring Fireball articles, or whatever is in the feed that you've chosen to use. Click the button to add the Pipe results to Google Reader.

Click the "Add to Google Reader" button.

Step 5

Now you should be in Google Reader staring at your newly subscribed feed. Click the "Feed settings" button.

Then choose "New Folder..." from the bottom of the list of option, and name the folder whatever you put in as the "Google Reader Label" above. In our case, it is "Daring Fireball Articles".

You should see the feed on the left sidebar, under the folder you just created.

Step 6

Now we have all the infrastructure in place to actually do something with this feed. But we have not yet defined the action that our webhook Pipe will execute. So we need to tweak this pipe slightly. Go back to the pipe. You'll find your cloned version (And you did clone it didn't you?) at http://pipes.yahoo.com/pipes/person.info, at the top.

Edit the pipe.

At the lower-right corner of the edit screen is a loop operator with no module or pipe defined.

Drag any pipe or valid module into this loop. This action will be called exactly once for every new item in the feed you have just defined. If you want to post to Tarpipe, I recommend taking a look at the Pipe http://pipes.yahoo.com/esjewett/post_to_tarpipe_1_0_api (you'll have to clone it as well), which will post each item to tarpipe, using fields you specify as the title and body of the post. But you could call an arbitrary pipe that makes a call to a web service or even YQL.

I am using this Pipe to call a Tarpipe that posts ever Daring Fireball article into Evernote automatically (the Tarpipe workflow key is fake, so don't get any ideas :-)

That's it.

Using this method of setup the pipe will not process existing entries in the feed, but it will process any new entries through the pipe you have assigned to the loop.

One limitation of this particular pipe is that it will not work reliably for feeds that are updated often. This is simply because Google Reader doesn't poll often enough. I have observed that Google Reader polls this feed every 4-8 hours. If more than 8 items are added between polls, older items will not be picked up by Google Reader and will not be processed by the pipe.

Ok, that's it. For real this time.

Syndicate content