books

Statistical misunderstandings and Google Book Search

Well, there I was, chatting away with (hopefully still) a friend about configuring a Twitter search widget for a blog and up comes the Google Books meta-data topic, in the form of a link to Google Books: A Metadata Train Wreck. This is the kind of article that really gets at me. Which is to say that it is an article holding a position that I mostly agree with in principle, but which does the position the disservice of making pretty questionable arguments.

So with that, I'll lay out two (or 5) things that really get my goat when people start talking about large data sets and meta-data.

The misunderstanding about the whole point

There seems to be a misunderstanding between the scholarly community and Google about what Google Books actually is. I think it is this misunderstanding that leads to claims like the one about "miscategorization" of translators as authors. It seems clear to me that the Google Books team made a conscious decision to put translators and authors into the same search field. As such, it's not an error so much as it is a semantic disagreement. I'm willing to bet that Google maintains author and translator metadata separately on the backend and just concatenates the fields for the purposes of searching.

In brainstorming, I came up with a few things Google Book Search and the whole library digitalization project might be, from Google's perspective:

  1. A vehicle for advertising and referral revenue for Google.
  2. A project to create a dataset that Google can use to train its translation and semantic knowledge engines.
  3. A dumb idea that will never help Google and will eventually be abandoned.

It's possible that Google really conceives the project as at least partially having the goal of enabling scholarship, but I think #2 is probably the real goal, having watched how Google works for about a decade now. If I'm right about this, then Google Book Search is just a cover; a way of making the data gathered from libraries publicly available. If #1 is right, then having the author and translator in the same field is exactly the right thing to do, since lots of people will mistakenly search for the translator instead of the author.

Would it be good for Google to provide a scholarly interface to Google Books metadata that splits out translator and author? Yes. Does Google have a responsibility to provide this interface? I don't see how.

The misleading or useless appeals to statistics and data

The second point is perhaps the more important one: There is often a misunderstanding or ignorance of statistical methods and the type of weird stuff that happens within large data-sets. I see this occasionally when the scholarly/library community blogs about Google Books, but I see it in a lot of other places too. This (I assume) ignorance results (again, I assume) in a belief that appeal to common sense or personal experience is a valid argumentative technique when large data sets are in play. From experience with large data sets, I have a tendency to believe that common sense and personal experience are worse than useless when large datasets are involved. In any case, this belief results in two major problems and we see examples of both in the above-linked blog.

Statistical problem #1: Claims about the number of errors in a particular set of meta-data without any comparison to existing meta-data sets that might give us a baseline of the number of errors we may expect. For example, if there are 2 errors of type X per 1000 records in Google Books, that doesn't tell me anything unless I also know that there are 0.3 errors of type X per 1000 records in the UC library system. That tells me Google Books has some catching up to do. On the other hand, maybe there are 5 errors per 1000 in the UC library system, in which case Google Books is doing pretty good. The blog (and most of these sorts of blogs) fails to give any useful metric for me to compare against.

Statistical problem #2: The sample size is usually left out. If there are 300 errors of type X in Google Books and only 100 errors of type X in my local public library ... well that probably means that Google books is 100x better than my local public library because it's got a lot more books in it. But I can't tell because no one bothers to say how many books are in Google Books and how many are in my local public library.

In summary: A metric like "572 errors of type X" is useless because I don't know the sample size or have anything to compare to. "2 errors of type X per 1000 records in Google Books as opposed to 0.4 errors of type X per 1000 records in the Harvard University library system" on the other hand is incredibly useful as it provides a basis for comparison and understanding.

To be completely clear

Again, I'm sympathetic to the desire for a great digitally accessible and searchable book repository. I'm sympathetic to the underlying concern about the devaluation of knowledge and education that drives a lot of these arguments (for example, this blog, which said friend later forwarded). But arguments made in an undisciplined and sometimes misleading manner actually do more to hinder the cause than to further it, at least in my mind.

Syndicate content