StenoKnight CART Blog: May 2012

Monday, May 28, 2012

CART Problem Solving: Speech Recognition Part III

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

Apologies for the lack of captioning in the first few seconds of the video, but I had to post it. It's a fantastic illustration of not just how often automatic speech recognition gets things wrong, but how wrong it tends to get them. There's a whole series of these Caption Fail videos, but this is the most work safe (and, in my opinion, funniest) of the lot.

See, because computers aren't able to make up for lossy audio by filling in the gaps using semantic and contextual clues, they make mistakes that a human transcriber would never make in a million years. On the one hand, you get "displaces" instead of "this place is". That's reasonable enough, and out of context a human might have made that mistake. But when a human hears "fwooosssssh" as a man tries to imitate the sound of the ocean with his mouth, the computer continues to try to read it as speech, and translates it "question of." Not only was it not able to differentiate between words and sound effects, but "fwoooosh" doesn't sound anything like "question of." The algorithms that computers use to match similar sound patterns to each other are so alien to our way of thinking that, unlike mistakes made by humans, we can't even hope to read through them to figure out what the correct version should have been.

I promised you some illustrations to use when trying to explain why accurate speaker independent automated speech recognition is not "just around the corner", despite the popular conception that it is. I think it's useful to consider your audience when trying to explain these. If you're talking to computer people, bringing in the parallels with OCR might be more effective than if you're talking to people who haven't used that sort of technology. If someone has never heard a beatboxer, my voicewriting to steno analogy comparing beat boxing to drumming won't mean much. Try to get an idea of the person's frame of reference first, and then construct your argument.

Belize, the Passport, and the Pixelated Photo

You know those procedural crime shows? Where the first 20 minutes is taken up with chasing seemingly disconnected clues, and then at last one of the detectives has a sudden flash of insight, puts all the pieces together, and knows where to find the suspect? Sometimes those shows do a very misleading thing. They'll get a blurry photo, something captured by a security camera or a helicopter, and the detective will say, "There! Zoom in right there!" Then the screen will shift and you'll get the same photo, except that the formerly blurry house will be nice and clear, and in the window... Is that a man with a gun? The detective will shout, "Zoom in on that flash of light there!" Again, the pixels will smear and redraw themselves, and look! Reflected in the man's glasses is another man in a Panama hat, wearing a mechanic's shirt with "Jerry" embroidered on the pocket and wielding a tire iron!

It's all very exciting, but it's also very wrong. If you take a blurry photograph and blow it up, you don't get a clearer view of its details; you just get a blurrier photograph with larger pixels. This is the essence of lossiness, but in an image rather than a sound. That visual information was lost when the photo was taken, and no amount of enhancement or sharpening tools will ever get it back. Unlike computers, humans are extremely good at inferring ways of filling in the gaps of lossy material, by using lateral clues. If the hard-won blurry photo of the criminal's coffee table just before he set his house on fire depicts a guide book, a passport, and a bottle of SPF 75 suntan lotion, a computer will either throw up its hands and say, "Not enough information found" or it will produce complete gibberish when trying to decipher its details. A human, on the other hand, will see that the letters on the guidebook, while extremely indistinct, seem to have approximately five spaces between them. The first letter is either an R, a P, or a B, and the last one is quite possibly an E. The passport shows that the criminal will be leaving the country, and the suntan lotion indicates that the location gets lots of sun. The savvy human detective paces around a bit and says -- I've got it! Belize! It's the only country that fits the pattern! Then they hop on the next flight and catch the criminal before the final credits.

The humans didn't need to fill in the gaps of the letters on the guide book in order to figure out the word it spelled, because they were able to use a host of non-textual clues to make up for the lossiness. Because computers programmed to recognize text aren't able to draw on all that subsidiary information, humans will always have an advantage in recognizing patterns, drawing inferences, and correcting errors caused by lossy input.

Why Hasn't Microphone Technology Improved More in 100 Years?

This leads me to the subject of speech recognition, which is a much thornier problem than text recognition. The answer to the question is simple: It has. Listen to an old Edison wax disc record and compare it to CD-quality sound produced by today's audio engineers, and you can hardly claim that recording technology has been stagnant. But the question behind my question is this: With all this fantastic audio recording technology, why is it still close to impossible to get quality audio in a room full of more than a handful of people? Make sure every speaker passes around a hand mic, or wears a lapel mic, or goes up to talk into the lectern mic, and you're fine. Put the best and most expensive multi-directional microphone on the market in the center of the room and half a dozen people sitting around a conference table, and you're sunk. Everyone sounds like they're underwater. The guy adjusting his tie near the microphone sounds like a freight train, while the guy speaking clearly and distinctly at the other end of the table sounds like he's gargling with marbles. Even $4,000 hearing aids have this problem. They're simply not as good as human ears (or, more accurately, the human brain) at filtering out meaningless room noises and selectively enhancing the audio of speakers at a distance. That's why onsite CART is often more accurate by several orders of magnitude than remote CART, no matter how much money is spent on microphones and AV equipment. When the bottleneck of sound input is a microphone, it's limited by its sensitivity, its distance from the speaker, and any interference between the two. That's the problem that still hasn't been solved, over a hundred years since the invention of recorded audio.

Having to transcribe terrible audio, guessing at omissions and listening to a five-second clip of fuzzy sound a dozen times before finally figuring out from context what it's actually about has been a real lesson in empathy for me. The frustration I feel in my home office, clicking the foot pedal over and over and listening to imperfect audio many times over, is nothing compared to what my hard of hearing clients feel in the course of their lives every day. They don't have a foot pedal to rewind the last few seconds of conversation, and they're not even getting paid to do this endlessly unrewarding detective work. I suppose I should feel even more sorry for the poor computers, who are trying to deal with substandard audio but don't have the luxury of lateral thinking or contextual clues or the ability to differentiate between soundalike phrases semantically. I've often wanted to rent out an hour of an audiologist's time and hook up the most popular commercial speech recognition software to their test system. I'd be very interested to see how it did. Of course, it could recognize all the tones perfectly well. It might even be all right at the individual words. But unlike a human with hearing loss, who usually does better at guessing words in the contexts of sentences than hearing them on their own, I bet you that the software would do considerably less well, and would probably come out with an audiogram that put them in the range of moderate to severe hearing loss, especially if any of the tests were given with stimulated noise interference mixed into the audio feed. I could be wrong, of course; I haven't yet gotten a chance to actually do this. But I'd be very interested to find out.

Well, this has run rather longer than I thought it would. I guess I'm going to have to do Speech Recognition Part IV next week. 'Til then!

Monday, May 21, 2012

CART Problem Solving: Speech Recognition Part II

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

CART PROBLEM: People don't understand why accurate automated speech recognition is incredibly hard for a computer to do.

I hear it all the time: "Hey, where can I buy that software you're using? It's so cool! I want it for my cocktail parties!" or "Wow, is your computer the one writing down what I'm saying? How much did that cost?" or "Oh, so you're the caption operator? Hey, what's that weird-looking machine for? Does it hook into the speech recognition software somehow?"

So many people think that completely automated speaker-independent speech recognition is already here. They think it's been here for years. Why? Well, people have had personal computers for several decades now, and even before computers were affordable, they could see people on television -- on Star Trek, most prominently, but in nearly every other science fiction show as well -- telling their computers to do things, asking them questions, and getting cogent, grammatical answers back. Why was this such a common trope in popular culture? Because typing is boring. It's bad television. Much better to turn exposition into a conversation than to force the viewers to read it off a screen. So in fiction, people have made themselves understood to computers by talking for a long, long time. In real life, they never have, and I think it's pretty plausible that they never will.

Don't get me wrong. I'm not denying the utility of voice recognition software for the purposes of dictation. It's very useful stuff, and it's improved the world -- especially the worlds of people with fine motor disabilities -- immeasurably. But the following statement, while true, turns out to be incredibly counter-intuitive to most people:

There is a huge qualitative difference between the voice of someone speaking to a computer and the voice of someone speaking to other humans.

People who have used voice recognition software with any success know that they need to make sure of several things if they want a clean transcript:

1. They need to speak directly into the microphone. 2. They need to articulate each word clearly. 3. They need to be on their guard for errors, so they can stop and correct them as they occur. 4. They need to eliminate any background interference. 5. The software needs to be trained specifically to their voice.

Even so, not everyone can produce speech that can be recognized by voice recognition software no matter how much training they do (see Speech Recognition Part I for more potential stumbling blocks), and they'll also find that if they try to record themselves speaking normally in typical conversation with other people and then feed that recording through the speech engine, their otherwise tolerable accuracy will drop alarmingly. People don't speak to computers the way they speak to other people. If they did, the computers would never have a chance.

Why is this? The answer is so obvious that many people have never thought about it before: Ordinary human speech is an incredibly lossy format, and we only understand each other as well as we do by making use of semantic, contextual, and gestural clues. But because so much of this takes place subconsciously, we never notice that we're filling in any of those gaps. Like the eye's blind spot, our brain smooths over any drops or jagged edges in our hearing by interpolating auxiliary information from our pattern-matching and language centers, and doesn't even tell us that it's done it.

What does it mean to say that human speech is lossy? People don't tend to talk in reverberant sound stages with crisp, clear diction and an agreed-upon common vocabulary. They mumble. They stutter. They trail off at the ends of sentences. They use unfamiliar words or make up neologisms. They turn their heads from one side to the other, greatly altering the pattern of sound that reaches your ears. A fire truck will zoom by beneath the window, drowning out half a sentence. But most of the time, unless it's really bad, you don't even notice. You're paying attention to the content of the speech, not just the sounds. An interesting demonstration of this is to try to listen to a few minutes of people speaking in a language you don't know and to try to pick out a particular word from the otherwise indiscernible flow. If it's several syllables long, or if it's pronounced in an accent similar to your own, you'll probably be able to do it. But if it's just one or two syllables, you'll have a very difficult time, much harder than if you were listening to the same conversation in your own language -- even if the audio quality was much worse than the other conversation, with tons of static interference and distortion -- and you were trying to latch on to a familiar word instead.

Humans can ignore an awful lot of random fluff and noise if they're able to utilize meaning in speech to compensate for errors in the sound of it. Without meaning, they're in the same state as computers: Reduced to approximations and probabilistic guessing.

No computers can use these semantic clues to steer by, and they won't be able to until they've achieved real, actual artificial intelligence; independent consciousness. It's an open question whether that will ever happen (though I'm putting my money on nope), but it's certainly true that in 50 years of trying to achieve it, computer scientists have made little to no progress. What they have been able to do is to mimic certain abilities that humans have in a way that makes them look as if they're understanding meaning. If you say a word or phrase to a speech recognition engine, it'll be able to sort through vast networks of data, in which connections between sound patterns or words are linked by how common or prominent they are compared to the rest of the network. For example, if you said something that sounded like "wenyu dottanai", it would compare thousands of short speech snippets until it found several matches for sounds that were very close (though never completely identical) to what "wenyu" sounded like in your own individual voice, in your own individual accent. It would probably come up with "when you". "Dottanai", likewise, would go through the same treatment, and the vast majority of similar matches would come up "dot an i"; it's a very common phrase. In most circumstances, it would probably be a pretty good bet.

If you were using this engine to transcribe the optometry interview I transcribed this evening, though, the answer it came up with would be completely wrong. Because this optometrist wasn't talking about dotting an i or crossing a t. He was talking about measuring the optical centers of a patient's vision, which he does by marking dots on a lens over each pupil. It wouldn't be the computer's fault for not getting that; it wouldn't have been paying attention to the conversation. Computers just figure out probabilities for each new chunk of sound. On Google, "dot an eye" gets 11,500 results, compared to 340,000 for "dot an i". Mathematically, it was a pretty safe bet. Semantically, it was completely nonsensical.

It can be hard to convince people of this principle, because a lot of times they still want to believe in the Star Trek model of speech recognition rather than the actual real-life one. So I've come up with a few brief anecdotes and illustrations to help get the message across. It's awfully late, though, so I think I'll have to leave those for Speech Recognition Part III. Here's a teaser, if you're interested:

* Belize, the passport, and the pixelated photo.
* Why hasn't microphone technology improved more in 100 years?
* Why do OCR errors still exist in paper-to-digital text conversion?
* Your physics professor decided to do her lecture in Hungarian today, but don't worry; you'll get a printed translation in two days.
* Trusting big screen open captioning to an automated system is a mistake event organizers only make once.
* Determinism versus the black box.
* The Beatboxer model of voice writing.

Wednesday, May 16, 2012

Ergonomic update

I've been thinking about ergonomics a lot since my previous post on the subject. In my onsite work, I only spend a few hours at a time in any one position; classes range from 1 to 3 hours, but there are usually breaks every hour or so. Because I'm working from home this summer, though, I've been spending up to 4 hours at a time at my desk doing CART, and then several more hours doing transcript editing, transcription work, or miscellaneous administrative tasks. Unfortunately, it's made me realize how un-ergonomic my setup really is, and how vital it is not to succumb to the temptation to just plant myself in one place and not move from it until the end of the day. My back and shoulders have been warning me that I'd better mix something up soon, or they're really going to start complaining.

I've helped solve my leg fatigue by using foam blocks as foot rests, because they can be shifted around and rolled from back to front under my feet whenever my legs start tightening up. I can also change their height by resting them on their three different edges, and if I want them even higher, I can stack one on top of the other.

One thing that's helped with the planting problem has been to move from the desk to the couch for transcription work, as I mentioned in my previous post, but also to run off of battery power initially, so that when my laptop's battery dies about 1.5 hours later, I'm forced to get up and go into the office for the charger. It might sound silly, but if I don't create those distractions for myself, I have a tendency not to move until my work is done, which is a habit I need to figure out how to break.

Here's another thing I've done, which seems to help a fair amount during the actual remote CART work itself:

As I've said many times, I adore my split-keyboard setup. The only thing that sometimes bugged me, though, is that my desk chair is a little too deep, so in order to reach the keyboard I have to either lean forward (hard on the back), bolster the seat back with several pillows (they tend to slip around and aren't that comfortable), or tilt the tripod forward and the two halves of the steno machine up, which doesn't quite work, because the main arm of the tripod still tends to get in the way. Yesterday I hit on a new solution: I took the armature from my old Gemini 2 machine, put it on a second tripod, and then unscrewed my Infinity Ergonomic from its own armature, putting one half of it on the original tripod and one half on the new one. This allows me to put one tripod on either side of my desk chair, eliminating interference from the tripod's main arm. It's working quite well so far.

I've got a feeling there's one more piece to the puzzle, though. My current desk chair was $50. I bought it at Staples last year. It's really not ideal; there's no lumbar support, it doesn't go high enough, it wobbles a lot, and it's just generally uncomfortable. Every day I spend in it makes me resent it a little bit more. I'm seriously considering buying a fancier chair, but they can be amazingly expensive. Someone on one of the captioner forums I read recommended this one:

It looks great, doesn't it? Ball jointed lumbar support. Headrest. Tons of adjustable settings. But it's $500. Yikes. Do I really want to spend that much money for a chair? Are there cheaper but still ergonomic alternatives out there? If any of you have recommendations, I'd very much like to hear them.

Monday, May 14, 2012

Speech Recognition Part I

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

CART PROBLEM: People claim that someday very soon human captioners will be replaced by automated speech recognition engines.

I've got a lot to say on this subject, but it's already late, so I'm going to leave most of the heavy duty analysis for next week's post. For now, I just want to show you a few examples. I first posted this video in 2010:

It's actual classroom audio from MIT's open courseware project. The video is of me captioning it live, using Eclipse. After posting it, I made a post on this blog about my CART accuracy versus the accuracy of YouTube's autocaptions, which at that time had just been released, with promises of increasing accuracy as the time went on.

Here's the transcript of the original autocaptions from 2010:

for implants support and you know as I haven't said anything about biology those folks didn't really need to be educated and genetics biochemistry more about it so about the to solve those problems and that's because biology as it used to be was not a science that engineers could addressed very well because in order for engineers really analyze study quantitatively develop models at the bill technologies all for the parts there's a lot of requirements on the science that really biology that that's why %um %um the actual mechanisms a function work understood yes you could see that moving your arm requires certain force in of would where certain load we really didn't know what was going on down in the proteins and cells and tissues of the muscles on the books okay but still you could decide maybe an artificial %uh to do this %um %uh in the plan you really know the molecular compliments so how the world he actually manipulate the system he didn't even know what the molecules work they're really underlying yes you couldn't really do the chemistry on the biological all %uh it's very hard to quantify you can even though the parts of the mechanisms how could you get quantitative measurements for them develop models so there's good reason why they're never really was a biological engineering until very recently while he wasn't the science that was released soon right here in dallas so there for the world biomedical engineering bailey although the deposition prompted it just talked about that that necessarily require miles per se but that's changed the good news for you folks is biology this change it's now a science that engineers had it been that to very well

Here's the updated transcript from the new autocaptions produced by YouTube's updated and improved speech recognition engine, circa 2012:

for implants and so forth and you know as i haven't said anything about biology those folks didn't really need to be educated in genetics biochemistry molecular body cell bowed to solve those problem and that's because biology as it used to be was not a science that engineers could address very well because in order for engineers really analyze study quantitatively develop models at the bill technologies alter the parts there's a lot of requirements on the science that really biology that satisfy uh... the actual mechanisms a function work understood yes you can see that moving your arm requires certain force in whip where a certain load we really didn't know what was going on down in the proteins and self-interest use of the muscles in the box creek but still you could decide maybe an artificial do this satanic plan you really know the molecular compliments so how the world to be actually manipulate the system to continue to know what the molecules were that are really underlying s thank you could really do the chemistry and biological molecule uh... it's very hard to quantify since if you need to know the parts of the mechanisms how could you get quantitative measurements for them develop models so there's good reason why there never really was a biological engineering until very recently has filed he wasn't a science that was really suited wrench in your analysis synthesis so there for the world biomedical engineering mainly involved all these application props that i've just talked about but that necessarily require biology per se but that's changed good news for you folks is biology those changes in our science that engineers had unfair connect to very well

It's replaced "%uh" with a more appropriate "uh..." and it gets certain words correct that it got wrong in the original, but it also concocted brand-new phrases like "certain force in whip", "self-interested use of the muscles in the box creek", and "an artificial do this satanic plan" for "certain force and would bear", "cells and tissues of the muscles and the bones. Okay?" and "an artificial bone to do this. An implant".

Here's the actual transcript of the video:

Or implants and so forth. And you notice I haven't said anything about biology. Those folks didn't really need to be educated in genetics, biochemistry, molecular biology, cell biology, to solve those problems. And that's because biology, as it used to be, was not a science that engineers could address very well. Because in order for engineers to really analyze, study, quantitatively develop models, and to build technologies, alter the parts, there's a lot of requirements on the science that really biology didn't satisfy. The actual mechanisms of function weren't understood. Yes, you could see that moving your arm required a certain force, and would bear a certain load, but you really didn't know what was going on down in the proteins and cells and tissues of the muscles and the bones. Okay? But still you could design maybe an artificial bone to do this. An implant. You didn't really know the molecular components, so how in the world could you actually manipulate the system, if you didn't even know what the molecules were, that are really underlying this? Okay? You couldn't really do the chemistry on the biological molecules. It's very hard to quantify, since if you didn't even know the parts and the mechanisms, how could you get quantitative measurements for them, develop models? So there's good reason why there never really was a biological engineering until very recently, because biology wasn't a science that was really suited for engineering analysis and engineering synthesis, and so therefore the world of biomedical engineering mainly involved all these application problems that I just talked about, that didn't necessarily require biology per se. But that's changed. Okay? The good news for you folks is biology has changed. It's now a science that engineers can in fact connect to very well.

If you like, go read my original post on the difference between technical accuracy and semantic accuracy. In that post, I determined that, counting only words that the autotranscription got wrong or omitted (not penalizing for extra words added, unlike on steno certification exams), the technical accuracy rate of the autotranscription was 71.24% (213/299 words correct). Two years and much supposed improvement later, the new transcription's technical accuracy rate is... Drum roll, please...

78.59% (235/299 words correct)

Now, I think it's important to point out that this video is essentially ideal in practically every respect for an autocaptioning engine.

* The rate of speech is quite slow. Speech engines tend to fail when the rate gets above 180 WPM or so.

* The speaker has excellent diction. Mumbling and swallowed syllables can wreak havoc with a speech engine.

* The speaker has an American accent. All the most advanced speech engines are calibrated to American accents, because they're all produced by American companies. There are programs that claim to understand various dialects of non-American-accented English (e.g. Scottish, English, Australian), but they're still many generations behind the cutting edge, because they got such a late start in development.

* The speaker is male. Speech engines have a harder time understanding female voices than male ones.

* The speaker is using a high number of fairly long but not excessively uncommon words. Speech engines are better at understanding long words (like "synthesis", "artificial", or "biochemistry") than short ones (like "would" or "weren't"), because they're phonologically more distinct from one another.

* The sound quality is excellent and there is no background noise or music in the video. Humans are able to listen through noise and pick out meaning from cacophony to a degree completely unmatched by any computer software. Even a speech engine that's performing quite well will fall completely to pieces when it's forced to listen through a small amount of static or a quiet instrumental soundtrack.

So if even a video like this can only attain a 78% technical accuracy rating, after two years of high-powered development from the captioning engine produced by one of the most technologically advanced companies in the world... Are you worried that it's going to supplant my 99.67% accuracy rating in another two years? Or ten years? Or 20? And that's just talking about the technical accuracy; I haven't even begun to get into the semantic accuracy. I'll have more to say on this subject in the next installment.