Sunday, January 27, 2008

The Times Reports & Is SciFi Really Wrong?

The New York Times today published an interesting, if brief, article about speech recognition in the mobile/telco space - cited as a "$1.6 billion market in 2007". The article provides a brief overview of a range of applications and mashups, such as vlingo.com and SimulScribe as well as some directory assistance services (but omitting some others such as SpinVox, GOOG411), that use voice.
The article opens:

"Innovation usually needs time to steep. Time to turn the idea into something tangible, time to get it to market, time for people to decide they accept it. Speech recognition technology has steeped for a long time"
And concludes:
"Even a digital expert [...] cautions that some people may never be satisfied with the quality of speech recognition technology — thanks to a steady diet of fictional books, movies and television shows featuring machines that understand everything a person says, no matter how sharp the diction or how loud the ambient noise."

But isn't this a bit hackneyed? Perhaps by today's standards a twenty-year steeping period seems long, but this is hardly the case anywhere else in history. And after re-watching 1982's Blade Runner recently, I actually felt rather optimistic that we are today close to what the movie's expectations for speech recognition and speaker verification were for 2019. Elsewhere , a similar picture emerges.
The Star Trek ship computer's speech recognition engine (the year is 2151), while accurate, stills require the push of a button to kick in, rather than listening for the hot word "computer", a capacity available , if not quite ripe for deployment, today.
Of course, there are the HALs (2001), Marvins (no date), C3P0s (Long long time ago...), whose capacities far exceed that, which we dare dream our mobile phones can one day understand. But here it seems the problem is less about the quality of speech technology - the quality of HAL's speech synthesis is available today, and Marvin's characteristic monotone baritone should be easy to do - rather than about the old hard-soft divide in Artificial Intelligence. As long as we use a hard-AI problem, which speech arguably is, to solve soft-AI problems ("find closest pizza service") we cannot fail to be disappointed.

Thursday, January 3, 2008

GOOG: We need more data

The old maxim "I need more data" should be familiar to anyone who has ever tried to wrestle with language technology issues, attempted speech application tuning or delved into any statistical approach to an AI-related problem. Google moved into the speech world last year with GOOG-411, a speech recognition driven directory assistance application (you say what you are looking for and where, it returns suitable businesses and connects you to the one you want or sends you details in an SMS).
Like all (well, most) other Google services, GOOG-411 is free for the end-user. As such, the basic business model (collect data, turn data into cash) applies. This was recently confirmed in interview by Marissa Mayer, Google's VP of Search Products and User Experience:

Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model ... that we can use for all kinds of different things, including video search.
Google thus couples statistical AI and its general data-driven approach to everything in a novel way. In doing so, Google may find itself in a catch-up race with the ilk of Nuance, Loquendo IBM, or Telisma, whose stronghold on speech recognition technology comes, in part, from having aggregated speech and language databases through data collection during professional services projects.
What's new in Google's approach, however, is the convergence of the dual role that data plays in AI and in the overall service-driven business model. Google will presumably not be content to bootstrap a pattern matching engine to sell licenses like the technology companies above. More interestingly to follow will be the range of services Google can spin using this technology (context sensitive video advertising, audio indexing, IVR hosting) which are more befitting of their overall company strategy.
Unsurprisingly, Mayer goes on to claim that Google isn't working on ways out of the world of brute-force data-driven algorithms:
People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. ... A lot of people will turn to things like the semantic Web as a possible answer to that. But what we're seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they're done through brute force.
User privacy advocates may also have a thought or two on this new dimension of data collection, as Google is beginning to loose the "conventionally trustworthy" image it held amongst many over the past years. Fortunately the ways in which speech data is commonly used to train pattern matching models involves very little in the ways of privacy infringement.
Happy data collecting!