Monday, May 21, 2012

CART Problem Solving: Speech Recognition Part II

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Test Nerves
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

CART PROBLEM: People don't understand why accurate automated speech recognition is incredibly hard for a computer to do.

I hear it all the time: "Hey, where can I buy that software you're using? It's so cool! I want it for my cocktail parties!" or "Wow, is your computer the one writing down what I'm saying? How much did that cost?" or "Oh, so you're the caption operator? Hey, what's that weird-looking machine for? Does it hook into the speech recognition software somehow?"

So many people think that completely automated speaker-independent speech recognition is already here. They think it's been here for years. Why? Well, people have had personal computers for several decades now, and even before computers were affordable, they could see people on television -- on Star Trek, most prominently, but in nearly every other science fiction show as well -- telling their computers to do things, asking them questions, and getting cogent, grammatical answers back. Why was this such a common trope in popular culture? Because typing is boring. It's bad television. Much better to turn exposition into a conversation than to force the viewers to read it off a screen. So in fiction, people have made themselves understood to computers by talking for a long, long time. In real life, they never have, and I think it's pretty plausible that they never will.

Don't get me wrong. I'm not denying the utility of voice recognition software for the purposes of dictation. It's very useful stuff, and it's improved the world -- especially the worlds of people with fine motor disabilities -- immeasurably. But the following statement, while true, turns out to be incredibly counter-intuitive to most people:

There is a huge qualitative difference between the voice of someone speaking to a computer and the voice of someone speaking to other humans.

People who have used voice recognition software with any success know that they need to make sure of several things if they want a clean transcript:

1. They need to speak directly into the microphone. 2. They need to articulate each word clearly. 3. They need to be on their guard for errors, so they can stop and correct them as they occur. 4. They need to eliminate any background interference. 5. The software needs to be trained specifically to their voice.

Even so, not everyone can produce speech that can be recognized by voice recognition software no matter how much training they do (see Speech Recognition Part I for more potential stumbling blocks), and they'll also find that if they try to record themselves speaking normally in typical conversation with other people and then feed that recording through the speech engine, their otherwise tolerable accuracy will drop alarmingly. People don't speak to computers the way they speak to other people. If they did, the computers would never have a chance.

Why is this? The answer is so obvious that many people have never thought about it before: Ordinary human speech is an incredibly lossy format, and we only understand each other as well as we do by making use of semantic, contextual, and gestural clues. But because so much of this takes place subconsciously, we never notice that we're filling in any of those gaps. Like the eye's blind spot, our brain smooths over any drops or jagged edges in our hearing by interpolating auxiliary information from our pattern-matching and language centers, and doesn't even tell us that it's done it.

What does it mean to say that human speech is lossy? People don't tend to talk in reverberant sound stages with crisp, clear diction and an agreed-upon common vocabulary. They mumble. They stutter. They trail off at the ends of sentences. They use unfamiliar words or make up neologisms. They turn their heads from one side to the other, greatly altering the pattern of sound that reaches your ears. A fire truck will zoom by beneath the window, drowning out half a sentence. But most of the time, unless it's really bad, you don't even notice. You're paying attention to the content of the speech, not just the sounds. An interesting demonstration of this is to try to listen to a few minutes of people speaking in a language you don't know and to try to pick out a particular word from the otherwise indiscernible flow. If it's several syllables long, or if it's pronounced in an accent similar to your own, you'll probably be able to do it. But if it's just one or two syllables, you'll have a very difficult time, much harder than if you were listening to the same conversation in your own language -- even if the audio quality was much worse than the other conversation, with tons of static interference and distortion -- and you were trying to latch on to a familiar word instead.

Humans can ignore an awful lot of random fluff and noise if they're able to utilize meaning in speech to compensate for errors in the sound of it. Without meaning, they're in the same state as computers: Reduced to approximations and probabilistic guessing.

No computers can use these semantic clues to steer by, and they won't be able to until they've achieved real, actual artificial intelligence; independent consciousness. It's an open question whether that will ever happen (though I'm putting my money on nope), but it's certainly true that in 50 years of trying to achieve it, computer scientists have made little to no progress. What they have been able to do is to mimic certain abilities that humans have in a way that makes them look as if they're understanding meaning. If you say a word or phrase to a speech recognition engine, it'll be able to sort through vast networks of data, in which connections between sound patterns or words are linked by how common or prominent they are compared to the rest of the network. For example, if you said something that sounded like "wenyu dottanai", it would compare thousands of short speech snippets until it found several matches for sounds that were very close (though never completely identical) to what "wenyu" sounded like in your own individual voice, in your own individual accent. It would probably come up with "when you". "Dottanai", likewise, would go through the same treatment, and the vast majority of similar matches would come up "dot an i"; it's a very common phrase. In most circumstances, it would probably be a pretty good bet.

If you were using this engine to transcribe the optometry interview I transcribed this evening, though, the answer it came up with would be completely wrong. Because this optometrist wasn't talking about dotting an i or crossing a t. He was talking about measuring the optical centers of a patient's vision, which he does by marking dots on a lens over each pupil. It wouldn't be the computer's fault for not getting that; it wouldn't have been paying attention to the conversation. Computers just figure out probabilities for each new chunk of sound. On Google, "dot an eye" gets 11,500 results, compared to 340,000 for "dot an i". Mathematically, it was a pretty safe bet. Semantically, it was completely nonsensical.

It can be hard to convince people of this principle, because a lot of times they still want to believe in the Star Trek model of speech recognition rather than the actual real-life one. So I've come up with a few brief anecdotes and illustrations to help get the message across. It's awfully late, though, so I think I'll have to leave those for Speech Recognition Part III. Here's a teaser, if you're interested:

* Belize, the passport, and the pixelated photo.
* Why hasn't microphone technology improved more in 100 years?
* Why do OCR errors still exist in paper-to-digital text conversion?
* Your physics professor decided to do her lecture in Hungarian today, but don't worry; you'll get a printed translation in two days.
* Trusting big screen open captioning to an automated system is a mistake event organizers only make once.
* Determinism versus the black box.
* The Beatboxer model of voice writing.


  1. "People don't tend to talk in reverberant sound stages with crisp, clear diction and an agreed-upon common vocabulary." - I agree to that. Sometimes, written words are more understandable, although they might be misinterpreted because of the lack of gestures or emotion (even improper use of punctuations). That's why in court, it's always necessary to have a court reporter who understands our lossy speech format and then types it, together with a video documentation to capture the gestures.

    - Kate Cazaly