CART Problem Solving Series
Superscript and Subscript
Communicating Sans Steno
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV
Apologies for the lack of captioning in the first few seconds of the video, but I had to post it. It's a fantastic illustration of not just how often automatic speech recognition gets things wrong, but how wrong it tends to get them. There's a whole series of these Caption Fail videos, but this is the most work safe (and, in my opinion, funniest) of the lot.
See, because computers aren't able to make up for lossy audio by filling in the gaps using semantic and contextual clues, they make mistakes that a human transcriber would never make in a million years. On the one hand, you get "displaces" instead of "this place is". That's reasonable enough, and out of context a human might have made that mistake. But when a human hears "fwooosssssh" as a man tries to imitate the sound of the ocean with his mouth, the computer continues to try to read it as speech, and translates it "question of." Not only was it not able to differentiate between words and sound effects, but "fwoooosh" doesn't sound anything like "question of." The algorithms that computers use to match similar sound patterns to each other are so alien to our way of thinking that, unlike mistakes made by humans, we can't even hope to read through them to figure out what the correct version should have been.
I promised you some illustrations to use when trying to explain why accurate speaker independent automated speech recognition is not "just around the corner", despite the popular conception that it is. I think it's useful to consider your audience when trying to explain these. If you're talking to computer people, bringing in the parallels with OCR might be more effective than if you're talking to people who haven't used that sort of technology. If someone has never heard a beatboxer, my voicewriting to steno analogy comparing beat boxing to drumming won't mean much. Try to get an idea of the person's frame of reference first, and then construct your argument.
Belize, the Passport, and the Pixelated Photo
You know those procedural crime shows? Where the first 20 minutes is taken up with chasing seemingly disconnected clues, and then at last one of the detectives has a sudden flash of insight, puts all the pieces together, and knows where to find the suspect? Sometimes those shows do a very misleading thing. They'll get a blurry photo, something captured by a security camera or a helicopter, and the detective will say, "There! Zoom in right there!" Then the screen will shift and you'll get the same photo, except that the formerly blurry house will be nice and clear, and in the window... Is that a man with a gun? The detective will shout, "Zoom in on that flash of light there!" Again, the pixels will smear and redraw themselves, and look! Reflected in the man's glasses is another man in a Panama hat, wearing a mechanic's shirt with "Jerry" embroidered on the pocket and wielding a tire iron!
It's all very exciting, but it's also very wrong. If you take a blurry photograph and blow it up, you don't get a clearer view of its details; you just get a blurrier photograph with larger pixels. This is the essence of lossiness, but in an image rather than a sound. That visual information was lost when the photo was taken, and no amount of enhancement or sharpening tools will ever get it back. Unlike computers, humans are extremely good at inferring ways of filling in the gaps of lossy material, by using lateral clues. If the hard-won blurry photo of the criminal's coffee table just before he set his house on fire depicts a guide book, a passport, and a bottle of SPF 75 suntan lotion, a computer will either throw up its hands and say, "Not enough information found" or it will produce complete gibberish when trying to decipher its details. A human, on the other hand, will see that the letters on the guidebook, while extremely indistinct, seem to have approximately five spaces between them. The first letter is either an R, a P, or a B, and the last one is quite possibly an E. The passport shows that the criminal will be leaving the country, and the suntan lotion indicates that the location gets lots of sun. The savvy human detective paces around a bit and says -- I've got it! Belize! It's the only country that fits the pattern! Then they hop on the next flight and catch the criminal before the final credits.
The humans didn't need to fill in the gaps of the letters on the guide book in order to figure out the word it spelled, because they were able to use a host of non-textual clues to make up for the lossiness. Because computers programmed to recognize text aren't able to draw on all that subsidiary information, humans will always have an advantage in recognizing patterns, drawing inferences, and correcting errors caused by lossy input.
Why Hasn't Microphone Technology Improved More in 100 Years?
This leads me to the subject of speech recognition, which is a much thornier problem than text recognition. The answer to the question is simple: It has. Listen to an old Edison wax disc record and compare it to CD-quality sound produced by today's audio engineers, and you can hardly claim that recording technology has been stagnant. But the question behind my question is this: With all this fantastic audio recording technology, why is it still close to impossible to get quality audio in a room full of more than a handful of people? Make sure every speaker passes around a hand mic, or wears a lapel mic, or goes up to talk into the lectern mic, and you're fine. Put the best and most expensive multi-directional microphone on the market in the center of the room and half a dozen people sitting around a conference table, and you're sunk. Everyone sounds like they're underwater. The guy adjusting his tie near the microphone sounds like a freight train, while the guy speaking clearly and distinctly at the other end of the table sounds like he's gargling with marbles. Even $4,000 hearing aids have this problem. They're simply not as good as human ears (or, more accurately, the human brain) at filtering out meaningless room noises and selectively enhancing the audio of speakers at a distance. That's why onsite CART is often more accurate by several orders of magnitude than remote CART, no matter how much money is spent on microphones and AV equipment. When the bottleneck of sound input is a microphone, it's limited by its sensitivity, its distance from the speaker, and any interference between the two. That's the problem that still hasn't been solved, over a hundred years since the invention of recorded audio.
Having to transcribe terrible audio, guessing at omissions and listening to a five-second clip of fuzzy sound a dozen times before finally figuring out from context what it's actually about has been a real lesson in empathy for me. The frustration I feel in my home office, clicking the foot pedal over and over and listening to imperfect audio many times over, is nothing compared to what my hard of hearing clients feel in the course of their lives every day. They don't have a foot pedal to rewind the last few seconds of conversation, and they're not even getting paid to do this endlessly unrewarding detective work. I suppose I should feel even more sorry for the poor computers, who are trying to deal with substandard audio but don't have the luxury of lateral thinking or contextual clues or the ability to differentiate between soundalike phrases semantically. I've often wanted to rent out an hour of an audiologist's time and hook up the most popular commercial speech recognition software to their test system. I'd be very interested to see how it did. Of course, it could recognize all the tones perfectly well. It might even be all right at the individual words. But unlike a human with hearing loss, who usually does better at guessing words in the contexts of sentences than hearing them on their own, I bet you that the software would do considerably less well, and would probably come out with an audiogram that put them in the range of moderate to severe hearing loss, especially if any of the tests were given with stimulated noise interference mixed into the audio feed. I could be wrong, of course; I haven't yet gotten a chance to actually do this. But I'd be very interested to find out.
Well, this has run rather longer than I thought it would. I guess I'm going to have to do Speech Recognition Part IV next week. 'Til then!