google

Statistical misunderstandings and Google Book Search

Well, there I was, chatting away with (hopefully still) a friend about configuring a Twitter search widget for a blog and up comes the Google Books meta-data topic, in the form of a link to Google Books: A Metadata Train Wreck. This is the kind of article that really gets at me. Which is to say that it is an article holding a position that I mostly agree with in principle, but which does the position the disservice of making pretty questionable arguments.

So with that, I'll lay out two (or 5) things that really get my goat when people start talking about large data sets and meta-data.

The misunderstanding about the whole point

There seems to be a misunderstanding between the scholarly community and Google about what Google Books actually is. I think it is this misunderstanding that leads to claims like the one about "miscategorization" of translators as authors. It seems clear to me that the Google Books team made a conscious decision to put translators and authors into the same search field. As such, it's not an error so much as it is a semantic disagreement. I'm willing to bet that Google maintains author and translator metadata separately on the backend and just concatenates the fields for the purposes of searching.

In brainstorming, I came up with a few things Google Book Search and the whole library digitalization project might be, from Google's perspective:

  1. A vehicle for advertising and referral revenue for Google.
  2. A project to create a dataset that Google can use to train its translation and semantic knowledge engines.
  3. A dumb idea that will never help Google and will eventually be abandoned.

It's possible that Google really conceives the project as at least partially having the goal of enabling scholarship, but I think #2 is probably the real goal, having watched how Google works for about a decade now. If I'm right about this, then Google Book Search is just a cover; a way of making the data gathered from libraries publicly available. If #1 is right, then having the author and translator in the same field is exactly the right thing to do, since lots of people will mistakenly search for the translator instead of the author.

Would it be good for Google to provide a scholarly interface to Google Books metadata that splits out translator and author? Yes. Does Google have a responsibility to provide this interface? I don't see how.

The misleading or useless appeals to statistics and data

The second point is perhaps the more important one: There is often a misunderstanding or ignorance of statistical methods and the type of weird stuff that happens within large data-sets. I see this occasionally when the scholarly/library community blogs about Google Books, but I see it in a lot of other places too. This (I assume) ignorance results (again, I assume) in a belief that appeal to common sense or personal experience is a valid argumentative technique when large data sets are in play. From experience with large data sets, I have a tendency to believe that common sense and personal experience are worse than useless when large datasets are involved. In any case, this belief results in two major problems and we see examples of both in the above-linked blog.

Statistical problem #1: Claims about the number of errors in a particular set of meta-data without any comparison to existing meta-data sets that might give us a baseline of the number of errors we may expect. For example, if there are 2 errors of type X per 1000 records in Google Books, that doesn't tell me anything unless I also know that there are 0.3 errors of type X per 1000 records in the UC library system. That tells me Google Books has some catching up to do. On the other hand, maybe there are 5 errors per 1000 in the UC library system, in which case Google Books is doing pretty good. The blog (and most of these sorts of blogs) fails to give any useful metric for me to compare against.

Statistical problem #2: The sample size is usually left out. If there are 300 errors of type X in Google Books and only 100 errors of type X in my local public library ... well that probably means that Google books is 100x better than my local public library because it's got a lot more books in it. But I can't tell because no one bothers to say how many books are in Google Books and how many are in my local public library.

In summary: A metric like "572 errors of type X" is useless because I don't know the sample size or have anything to compare to. "2 errors of type X per 1000 records in Google Books as opposed to 0.4 errors of type X per 1000 records in the Harvard University library system" on the other hand is incredibly useful as it provides a basis for comparison and understanding.

To be completely clear

Again, I'm sympathetic to the desire for a great digitally accessible and searchable book repository. I'm sympathetic to the underlying concern about the devaluation of knowledge and education that drives a lot of these arguments (for example, this blog, which said friend later forwarded). But arguments made in an undisciplined and sometimes misleading manner actually do more to hinder the cause than to further it, at least in my mind.

Make a webhook out of anything

(or ... "How I learned to stop worrying about doing it right, and just make the damn thing work")

I've had this problem for a while:

I use a great service called Instapaper (try it, seriously) for keeping track of my reading list. Which is great. But I want stuff to happen to certain items once I'm done reading something in Instapaper. They should be posted to Twitter, stored in Evernote, or squirreled away in Diigo, del.icio.us, or Pinboard.

This just isn't very achievable. While Instapaper is totally awesome, it does not provide automatic posting to all (or any) of these sites. It doesn't even provide Webhooks, which might provide the ... hook ... to allow for this sort of posting via some service like Yahoo! Query Language (YQL) or Tarpipe. What Instapaper provides is an RSS feed of shared items.

I've previously reviewed how to use YQL to post to Tarpipe, which solves part of the problem. It is possible to use this technique to consume a feed in Yahoo! Pipes and then post each item to Tarpipe. But this isn't really what I want to do, because it will post each item to Tarpipe each time the feed is read. Which is going to result in a lot of duplicate Tweets, Evernote notes, or whatever else I'm having Tarpipe do.

What we need is a Yahoo! Pipe that will only call the special Tarpipe (or other) YQL when there is a new item in the feed. Pipes isn't very good at this, but in steps Google Reader. Pretty much all that Google Reader does is query a feed occasionally and keep track of when a new item appears.

My strategy (and it works, bless Google and Yahoo!'s hearts) is to use Google Reader to check a Pipe and cache the items it has seen in a publicly accessible label. This is a little circular, but the very Pipe that Google Reader is checking pulls the feed I want to webhookify and compares the contents of the feed to the contents of the Google Reader label. If the Pipe sees any items in the feed that aren't yet stored in the Google Reader label, it does its magic on only those items.

I've made the Pipe that does this public at http://pipes.yahoo.com/esjewett/feed_to_webhook_using_google_reader but getting it working take a little doing, which is described below with screenshots. I'll demonstrate using a public feed (Daring Fireball's main article feed, because just about every single article is worth bookmarking), but you could use it on any feed that Yahoo! Pipes can access. I use it on my Instapaper starred items feed, among other things.

Step 1

Determine your Google Reader User ID by selecting your Shared Items feed in Google Reader:

Once you've selected this feed, take note of the URL in the address bar of your browser. It includes a string that is your Google Reader User ID. The ID contains only numbers. That "F" in front of it is not part of the ID and the "%" after it is not part of the ID. Copy this ID down somewhere as you'll need it later.

Step 2

Create your Pipe by cloning http://pipes.yahoo.com/esjewett/feed_to_webhook_using_google_reader

Step 3

Populate the user input fields of the Yahoo! Pipe with the feed you want to use ("http://daringfireball.net/feeds/articles" in our case), the Google Reader User ID from step 1, and the label you are going to make publicly accessible in Google Reader in a later step. The label you choose is important. It needs to be a label that is used for only this purpose and only this feed. Make it unique and call it something that will remind you of its purpose. I'll call mine "Daring Fireball Articles".

Once you've done all this, click the "Run Pipe" button.

Step 4

You should at this point get a list of the latest Daring Fireball articles, or whatever is in the feed that you've chosen to use. Click the button to add the Pipe results to Google Reader.

Click the "Add to Google Reader" button.

Step 5

Now you should be in Google Reader staring at your newly subscribed feed. Click the "Feed settings" button.

Then choose "New Folder..." from the bottom of the list of option, and name the folder whatever you put in as the "Google Reader Label" above. In our case, it is "Daring Fireball Articles".

You should see the feed on the left sidebar, under the folder you just created.

Step 6

Now we have all the infrastructure in place to actually do something with this feed. But we have not yet defined the action that our webhook Pipe will execute. So we need to tweak this pipe slightly. Go back to the pipe. You'll find your cloned version (And you did clone it didn't you?) at http://pipes.yahoo.com/pipes/person.info, at the top.

Edit the pipe.

At the lower-right corner of the edit screen is a loop operator with no module or pipe defined.

Drag any pipe or valid module into this loop. This action will be called exactly once for every new item in the feed you have just defined. If you want to post to Tarpipe, I recommend taking a look at the Pipe http://pipes.yahoo.com/esjewett/post_to_tarpipe_1_0_api (you'll have to clone it as well), which will post each item to tarpipe, using fields you specify as the title and body of the post. But you could call an arbitrary pipe that makes a call to a web service or even YQL.

I am using this Pipe to call a Tarpipe that posts ever Daring Fireball article into Evernote automatically (the Tarpipe workflow key is fake, so don't get any ideas :-)

That's it.

Using this method of setup the pipe will not process existing entries in the feed, but it will process any new entries through the pipe you have assigned to the loop.

One limitation of this particular pipe is that it will not work reliably for feeds that are updated often. This is simply because Google Reader doesn't poll often enough. I have observed that Google Reader polls this feed every 4-8 hours. If more than 8 items are added between polls, older items will not be picked up by Google Reader and will not be processed by the pipe.

Ok, that's it. For real this time.

Telemetry in the enterprise

Stephen O'Grady has a nice post clarifying his position on the telemetry opportunity in the enterprise, for open source vendors especially. "Telemetry" being here the practice of contracting with customers to send data back to the vendor, which the vendor then uses to provide some value add the to the customer. This value might take the form of better software, optimization tips tailored to the customer, or metrics comparing the customer to the full pool of customers.

Oddly enough, this might be one of those places where the enterprise is a bit ahead of the opensource world. Personally, I'm aware of SAP providing several telemetry-like services focuses on performance, maintenance, and bug/problem reporting.

It's also worth pointing out that SAP's traditional business model of partnering with customers to understand processes and develop business applications could be considered telemetry by some definitions. True, it's human-based telemetry rather than a technical system of reporting back to the vendor, but the point is the same: the customer transmits information about how they use or will use the software to the vendor while the vendor uses that information to provide a better experience to all customers.

In the open source space my impression is that we see telemetry (such as it is) focused primarily in the software quality space. Crash reporting is probably the primary place we see this, as in the ubiquitous Firefox beta crash reporter.

As O'Grady point out, software as a service companies have recognized this are as a great opportunity to develop competitive advantage. Google uses telemetry data extensively in its search product (suggestions, especially), as well as in other products like Analytics (benchmarking) to name an example in the enterprisey-er space.

In the consumer space, we're starting to see many tentative steps. Wasabe and Mint are both financial apps that allow the user to compare their activities against the aggregated activity of relevant slices of the apps' user bases. The idea is powerful, though I'm not particularly impressed by the utility of the implementations we're seeing outside of Google. It seems like apps are going to need to give up on trying to get the user to dig through this data goldmine and instead just deliver up specific nuggets, as Google does with its search suggestions.

Syndicate content