Google 411 Harvesting Speech Data

I just stumbled upon (not with suble upon though) this Scary blog post about google data mining your voice:

Assuming each call has four utterances – and costs 3 cents each on average – then it would cost Google about $8M for 1 billion utterances (or $80M to match TellMe’s 10 billion utterances). A bargain compared to buying Tellme for $800M. I agree with Tim O’Reilly’s conclusion that Google’s launch of a directory assistance product was accelerated (and is driven) by their desire to compete with Microsoft.

The post, which might be a little over the top reference Tim O’Reilly, who speculates about the motivations of Google’s free 411 directory assistance


This is reminiscent of a comment that Peter Norvig, Director of Research at Google, made to me last year about automated translation, and why it’s getting better. “We don’t have better algorithms. We just have more data.”
In short, I’m speculating that the 1-800-GOOG-411 service is designed to harvest voice data to build Google’s own speech database, rather than licensing from Nuance or another player.

This is kind of creepy from a privacy stand point, but as long as Google refrains from being evil, I guess we’re safe, right?

Interestingly, O’Reilly says “many of the future battles between industry giants will be around who owns data.” Definitely scary when you talk about your own consumer data, what about scientific data? Like the traffic patterns of pelagic animals.