Twitter is a big ol' international hodgepodge of communities, conversations, and observations to the tune of 700,000+ active users. Being international, it also tends to be multi-lingual. Herein lies a problem: I can't understand, say, Arabic. Or French or Japanese for that matter. In our brave new world of computerized , automated everything, there's no technical reason these conversations can't be translated on the fly, if only we had a little metadata.
Metadata, in this case, would be data about the language someone is tweeting in. You see, from a consumer perspective the real problem of automating translation isn't the translation part. Google and Yahoo! have that pretty well nailed on the level of 140 character tweets that often appear to have been run through a translation program anyway. The problem is figuring out, automatically, the language of the post and the language it should be translated into.
Enter Twitter nanoformats. Nanoformats are closely related to microformats. So closely related that they even live on the Microformats wiki. Anyway, nanoformats are little pieces of text you can insert into tweets and other things to provide some sort of metadata, like location, tags, or (surprise) language. The language nanoformat consists of the string "lang:" followed by the iso 639-1 code for whatever language the Tweet is tweeted in.
In order to enable auto-translation, a Twitter user would have to do two things. First, our user would insert a lang nanoformat in their own bio to indicate the language they normally tweet in. This bio nanoformat will also be used to figure out the language our user would like to receive other tweets in, so choose wisely. Second, our user would insert a lang nanoformat in tweets they make that are not in the language their bio indicates.
So, for example, I normally twitter in English, so my profile indicates that with the "lang:en" nanoformat. But if I tweet something in Spanish, I would add the "lang:es" nanoformat to that tweet. (Don't worry, I won't subject you to any Spanish tweets, notwithstanding the example below.)
Now that we've done all that groundwork, the magic can begin.
Warning, the following paragraphs contain forward-looking statements. No client actually does this stuff yet, though it's conceivable that I am under-informed on that point.
When I fire up my Twitter client and log in, the client checks out my bio and sees that I want my posts in "en", which means "English" in ISO-ese. We'll call this imaginary client Litter, but we could just as well call it Twhirl, Snitter, Tweetr, or Twitterific.
Now, whenever Litter sees a tweet, it checks the tweet for the lang nanoformat and if it finds one, and it isn't "lang:en" it makes a quick call to http://translate.google.com to translate the tweet, sans nanoformat. If the client doesn't find a lang nanoformat in the tweet, it should check the bio of the person the tweet belongs to for a lang nanoformat and perform the same call.
So the tweet "Hablo español lang:es" would result in the following call in my Litter client:
http://translate.google.com/translate_t?text=Hablo español&langpair=es|en
Click that and see what comes out.
What happens next involves some screen scraping of dubious legality, but maybe some agreement can be reached. Anyway, Litter then pulls the result of this query out of the HTML mishmash it receives back. It provides that translated result as the content of the tweet.
There are a lot of reasons this won't work.
First of all, there is a 70 request per hour limit on the Twitter API. Litter would exhaust that limit in about 5 minutes if it were trying to translate every tweet in the stream I receive. As a result, I suggest that the translation function be implemented with a manual trigger, at least initially. The imaginary Litter allows a user to right-click on a tweet and choose "Translate" from the context menu.
It also appears Google's terms of use don't allow this sort of use of the Google Translate service. Similarly Yahoo! doesn't appear to allow this use of Babelfish. However, there are a lot of dubious uses of these sorts of tools, so maybe we can come to some sort of agreement. In the meantime, a working proof on concept would be pretty neat.
Lastly, there should always be some way to easily get at the original tweet within the client, just in case something goes horribly wrong in the translation. This is technically an opt-in system, since it won't work on a person's tweets if they don't use the nanoformat, but we should still be considerate.
Despite these hurdles I'm certainly looking forward to this type of functionality becoming common, and not just on Twitter. I know others are as well. What do you think? How can we make this happen?