Friday, November 28, 2008

IBM Predicts Talking Web

IBM's annual crystal ball list of Innovations That Will Change Our Lives in the Next Five Years includes a forecast of a voice-enabled talking web. "You will be able to sort through the Web verbally to find what you are looking for and have the information read back to you," the article predicts.
IBM itself has launched several voice-enabled products and initiatives over the years, most notably the WebSphere Voice family of web servers, which adds various voice functionality to its flagship WebSphere platform, leveraging it in areas such as unified messaging and call-center automation.
Some problems exist with a vision as the one advocated by the article. Speech recognition accuracy and noise filtering have obviously come a long way and may only pose a minor impediment.
The user's desire to speak rather than type or click is another problem. Issuing voice commands in the presence of others may not always be desirable and can be disruptive, for instance at work on public transport. Lastly, there are usability concerns, beyond the quality of speech technology, when converting a visual 2- or even 3-dimensional representation of information into a 1-dimensional audio stream. The cognitive load increases significantly with tasks more complex than, for instance, obtaining time-table information or finding the nearest Italian restaurant.
The effort that stands behind the vision, to put voice technology to uses beyond call-center automation, is laudable. Mobile internet access and computing on-the-road may indeed do their parts to make this vision come true. And clearly, there are use cases, such as improved accessibility for users with impairments, that on their own accord merit making the web voice-accessible. Wide-spread usage of a voice-enabled web, however, may be more than five years off.

Tuesday, November 18, 2008

Google Mobile iPhone App with Speech Recognition

Google released a new feature for its Google Mobile iPhone Application yesterday: voice search. Users speak a query and the application returns search results formatted for the iPhone. This is similar to the GOOG411 directory assistance application, which allows users to call a phone number, speak a query and receive information about local listings in voice or SMS formats. However the new application apparently performs recognition locally on the iPhone, meaning it comes bundled with an embedded speech recognition engine.

Aside from GOOG411, during the US presidential Google released Gaudi, a voice indexing technology for video. That makes the iPhone app the third official service the company releases, making use of speech recognition, leaving one guessing when Google's speech technology becomes available as API, like the Google AJAX Language API for translation and transliteration, rather than bundled as software services. Also, an Android version is probably in the works, one would guess.

All applications are available in US English for now.

Thursday, October 2, 2008

Nuance buys Philips Speech Recognition Systems

Nuance announced this week its acquisition of Philips Speech Recognition Systems. This represents another step in a series of acquisition by the speech technology giant towards market and portfolio expansion. In 2002, Scansoft Inc., which through further mergers and acquisitions became today's Nuance, already acquired Philips' network speech processing group, though not its dictation unit. With this weeks acquisition, the dictation unit will be incorporated into Nuance's already strong dictation portfolio, expanding especially on European healthcare markets, the company announced. Highlights of the purchase include increasing customer base, language & solutions portfolios, distribution channels as well as a great leap forward in international expansion.

Friday, September 19, 2008

Google Showcases Audio Indexing with Gaudi

Google Labs opened GAudi this week to showcase its new audio indexing technology.

Google GAudi allows searching for keywords/phrases in the audio-stream of selected YouTube videos. Matches are represented as yellow slots on the playback slider. Top results appear as snippets of text from the audio surrounding the search term as well as information how many minutes into the video the term occurred.

The video material chosen to showcase GAudi is material concerning this year's US presendential elections as "part of a broader effort around politics", but also because of the high performance with such material and the relevance to testers and users.

Indexing does not appear to be complete, as using randomly chosen text fragments from showcased videos did not always result in a match. Google does say Gaudi is using its own speech recognition engine, perhaps the same employed by GOOG411, though most FAQs about technical details and how one could use GAudi for video are directed to email inquiries.

While GAudi is showcasing campaign material, it seems only a matter of time before audio indexing will be available for serving ad content on video.

Monday, September 8, 2008

Microsoft Windows Live Messenger Translation Bot

In the wake of Google's release of its Chrome web-browser, speculation on plans for Chrome on other platforms, including Android have drifted ashore. Naturally this has washed aside much recent IE8 news, which, though not a game-changer, is said to introduce many of the much-needed improvements everyone has been looking for from Microsoft.

In light of the browser war raging, a little add-on for Microsoft's Live Messenger may not stir many waters, even if it promises real-time chat translation between English and 14 other languages. However it is still refreshing to read about technology, which is geared at opening channels of communication, rather than capturing market shares.

What are Google's plans with Chrome and Android viz. Microsoft IE on Windows Mobile? Will Microsoft leverage its non-browser language services such as translation and speech recognition like Google has been?

Monday, May 19, 2008

OnMobile buys Telisma

OnMobile Global Ltd today acquired France-based Telisma, a producer of speech recognition software for network/telephony environments.
The acquisition comes at a time after OnMobile recently partnered with Nuance, a Telisma competitor for speech recognition markets, to deploy voice search applications for its home market, India. India's multilingual market has made it a tough one to crack for speech technology companies, though a lucrative one as India has recently surpassed the U.S. as the second largest mobile market in the world, according to Om Malik at GigaOm.
I suspect issues specific to speech technology and India's multilingualism have something to do with this deal. As I recently pointed out, internationalization of speech and language technologies comes at a steep entry cost, due to the high demands on expertise and data required for building language-specific models. In addition, speech recognition companies like Nuance have long kept their language models under wraps. In other words, if your language isn't catered to, reaching that language's customer base becomes a very pricey affair.
While open-source aspirations to build freely availably language models for speech recognition exist, Telisma has opted on middle-ground in this matter by allowing partners/customers to build their own models, but selling the tools to do so at a price. In a market like India, the ability to cater to a multi-lingual customer base without purchase of expensive proprietary software (or paying someone else to develop proprietary software for you to purchase) may have made a big difference in this deal.

On a different note, this acquisition is the latest in a series of acquisitions consolidating the speech technology market. While five years ago telephony speech technology was a highly redundant market of small companies building similar products, today they have largely been acquired by or merged with bigger players. In the meantime, companies like Microsoft, IBM, Siemens and Google are making their own moves to enter the market.

Telismas acoustic modelling toolkit is indeed not for sale, but for free, as one reader has pointed out. Thanks!

Monday, May 5, 2008

Internationalization and Speech Technologies

The not-so-subtle truth is, of course, that we all speak English. Yet localization and internationalization are at once prerequisite and stumbling stone for many web-based endeavors.

In my own backyard, two examples illustrate the effect and need for of internationalization, respectively. German professional social network XING has internationally outperformed competitors like LinkedIn through early and aggressive internationalization. StudiVZ - the "German Facebook" has gained much of the student social network market before Facebook decided to release a German version of its web app, making this a tough-to-crack market.

Ironically, as these two examples underline, the need for localization remains in cases where the demands on usability are low (join group/contact person/send message) and the target audience can largely be expected to speak sufficient English (read this for an interesting take on the same issues and solutions in online gaming.) Moreover, localization is an effort far greater than providing an interface in the local language.

As one expects, localization and internationalization and speech technology are inextricably linked - in a sense developing speech technologies is internationalization. And using such technology in professional service projects is akin to building a internationalized web application. Here are some of the oddities I've observed while working with speech technologies in an international environment:

Translation is not enough. When you write software that speaks or wants to be spoken to, there is more at stake than providing interface text. Can you expect all your users to spell input when your system doesn't understand the raw speech input? Can you be sure that all your translated content will generate well-formed speech-synthesis output? Language and culture are sensitive issues, so a well-localized speech application must do more than provide translated user interface. Employing local staff is usually a minimum to building a speech application for a new market.

The cost shifts. Re-usability of resources from previous speech projects is usually low. So unlike localizing a web application, porting a speech application requires grunt work that you thought you had done the first time around. Moreover, speech applications in new languages almost always come with additional licensing burdens and questions about the appropriate technology partner. Expect to pay for things you didn't expect.

There is no long tail. The buy-in costs for developing a new language in almost any speech or language technology (recognition, synthesis, translation) remain constant. This makes every newly developed language a strategic decision and translates into a two-tier localization effort: one developing basic technologies, one employing such technology in professional service projects.
As an example, the world's most successful dictation software packages: Dragon Naturally Speaking ships in five flavors of English and six European languages. Philip's Speech Magic ships in 23 dialects of 11 languages. Both a far cry from world-coverage.
The enormous cost of development has a decided effect on developing speech technology for lesser-spoken languages. And it has posed a significant hurdle as well for open-source initiatives of speech technologies to provide such resources for free.

Sunday, January 27, 2008

The Times Reports & Is SciFi Really Wrong?

The New York Times today published an interesting, if brief, article about speech recognition in the mobile/telco space - cited as a "$1.6 billion market in 2007". The article provides a brief overview of a range of applications and mashups, such as and SimulScribe as well as some directory assistance services (but omitting some others such as SpinVox, GOOG411), that use voice.
The article opens:

"Innovation usually needs time to steep. Time to turn the idea into something tangible, time to get it to market, time for people to decide they accept it. Speech recognition technology has steeped for a long time"
And concludes:
"Even a digital expert [...] cautions that some people may never be satisfied with the quality of speech recognition technology — thanks to a steady diet of fictional books, movies and television shows featuring machines that understand everything a person says, no matter how sharp the diction or how loud the ambient noise."

But isn't this a bit hackneyed? Perhaps by today's standards a twenty-year steeping period seems long, but this is hardly the case anywhere else in history. And after re-watching 1982's Blade Runner recently, I actually felt rather optimistic that we are today close to what the movie's expectations for speech recognition and speaker verification were for 2019. Elsewhere , a similar picture emerges.
The Star Trek ship computer's speech recognition engine (the year is 2151), while accurate, stills require the push of a button to kick in, rather than listening for the hot word "computer", a capacity available , if not quite ripe for deployment, today.
Of course, there are the HALs (2001), Marvins (no date), C3P0s (Long long time ago...), whose capacities far exceed that, which we dare dream our mobile phones can one day understand. But here it seems the problem is less about the quality of speech technology - the quality of HAL's speech synthesis is available today, and Marvin's characteristic monotone baritone should be easy to do - rather than about the old hard-soft divide in Artificial Intelligence. As long as we use a hard-AI problem, which speech arguably is, to solve soft-AI problems ("find closest pizza service") we cannot fail to be disappointed.

Thursday, January 3, 2008

GOOG: We need more data

The old maxim "I need more data" should be familiar to anyone who has ever tried to wrestle with language technology issues, attempted speech application tuning or delved into any statistical approach to an AI-related problem. Google moved into the speech world last year with GOOG-411, a speech recognition driven directory assistance application (you say what you are looking for and where, it returns suitable businesses and connects you to the one you want or sends you details in an SMS).
Like all (well, most) other Google services, GOOG-411 is free for the end-user. As such, the basic business model (collect data, turn data into cash) applies. This was recently confirmed in interview by Marissa Mayer, Google's VP of Search Products and User Experience:

Whether or not free-411 is a profitable business unto itself is yet to be seen. I myself am somewhat skeptical. The reason we really did it is because we need to build a great speech-to-text model ... that we can use for all kinds of different things, including video search.
Google thus couples statistical AI and its general data-driven approach to everything in a novel way. In doing so, Google may find itself in a catch-up race with the ilk of Nuance, Loquendo IBM, or Telisma, whose stronghold on speech recognition technology comes, in part, from having aggregated speech and language databases through data collection during professional services projects.
What's new in Google's approach, however, is the convergence of the dual role that data plays in AI and in the overall service-driven business model. Google will presumably not be content to bootstrap a pattern matching engine to sell licenses like the technology companies above. More interestingly to follow will be the range of services Google can spin using this technology (context sensitive video advertising, audio indexing, IVR hosting) which are more befitting of their overall company strategy.
Unsurprisingly, Mayer goes on to claim that Google isn't working on ways out of the world of brute-force data-driven algorithms:
People should be able to ask questions, and we should understand their meaning, or they should be able to talk about things at a conceptual level. ... A lot of people will turn to things like the semantic Web as a possible answer to that. But what we're seeing actually is that with a lot of data, you ultimately see things that seem intelligent even though they're done through brute force.
User privacy advocates may also have a thought or two on this new dimension of data collection, as Google is beginning to loose the "conventionally trustworthy" image it held amongst many over the past years. Fortunately the ways in which speech data is commonly used to train pattern matching models involves very little in the ways of privacy infringement.
Happy data collecting!