Automatic speech recognition technology (ASR) and its counterpart Interactive voice recognition (IVR) are both important achievements in the development of machine human interaction.
Even if you’ve never specifically heard of these technologies, you almost certainly would recognize their practical applications. They’re used in systems such as automated telephone banking, computer voice command systems (often for the disabled), text editing software such as Dragon Naturally Speaking and, most notably of all, the voice powered interfaces of some modern smart phones. Siri is a well-known example of this last application.
Click to view the larger version of the image
Now that you have a basic picture of exactly what we’re talking about, you probably remember having used ASR technology at least one in your life. Now however, we’re going to cover just how these systems actually work and how they’re made to meaningfully interpret human speech.
The Basics of ASR/IVR Technology: A Primer
The principal mechanism by which ASR technology intuits what you’re trying to tell it in a way that allows it to respond follows the following steps:
- You talk to your ASR enabled device
- The ASR software turns your speech into a raw wave
- The software then cleans this wave up by reducing background noise and normalizing volume
- The resulting clean sound wave form is then broken down into its component phonemes. Phonemes are the basic building block sounds of our language; such as “wh”, “ka” and “t” for example. There are 44 of them in the English language and 49 in Italian.
- Each phoneme is like an individual chain link in a long chain and by analyzing the phonemes individually in sequence, ASR software can intuit complete words and then complete sentences.
- By intuiting words and sentences via phoneme analysis, your ASR software then “understands” you and can thus respond to your requests.
Some Examples of ASR Variations
While there are a number of variations of ASR software in use, the two main versions of the technology are geared towards “directed dialogue conversations” and “natural language conversations”.
Directed Dialogue conversation: ASR systems that work on this operating principle are generally much simpler than their natural language counterparts. This type of ASR/IVR technology is basically what you’d find in any automated telephone banking interface. It relies on verbally asking you to select specific words from a limited menu of choices and responds based only on the words listed. As you can see, quite limited and simple.
Natural language conversation: Natural language conversation ASR systems are the much more sophisticated version of ASR. They represent the latest in the technology’s development and are designed to allow a person to have a more open ended conversation with the ASR system. The more sophisticated the ASR being used, the more natural and “close to human level” that conversation can be. The iPhone’s Siri is a fairly advanced example of a natural language ASR.
So How Does Natural Language ASR Work?
As the cutting edge of ASR technology, natural language systems are much harder to develop than their directed dialogue variants. This is largely because the difficulties of making a mindless machine respond in conversation as if it were a conscious being are considerable.
For example, a typical 60,000 word ASR vocabulary can have over 215 trillion possible word combinations! Humans can get over these obstacles by consciously intuiting what to say without scanning every word combination we could possibly come up with, and modern ASR systems are in a certain way programmed to do the same.
Instead of making the software scan through 215 trillion word combos or even all 60,000 words in its vocabulary, programmers have created a “tagged keyword” list of words that give context for a list of common human requests.
Thus for example, if you say the word “forecast” to your ASR enabled device, it will guess that your other word is “weather” and not “whether” based on the context created by you saying “forecast”.
The Tuning Test: How ASR is “trained” to Understand you Better
Two main ways of training ASR systems to be better at conversation exist. These can be broken down into human “tuning” or a machine performed process of “active learning”.
Human Tuning: This is how simpler ASR systems are taught new vocabulary. It basically consists of human programmers going through the software’s conversation logs and reviewing them for new words they can add to the ASR vocabulary if the words have been heard by the system enough.
Active Learning: Active learning is the much more sophisticated version of ASR learning. In basic terms, the software itself is designed to autonomously learn and adopt new words and word uses on the fly if it hears them enough. The ASR can then give much more contextually correct and user specific responses to the human speaking to it.
An example of this would be a human users constantly cancelling the auto-correct on a certain word until the software recognizes the new use of the word as permissible.
So this is our basic introduction to the powerful and fascinating developments that make human/machine voice interaction possible. If you want to find out more in a more visual way, check out this awesome infographic that complements this post, from the people at West Interactive.