Thursday, December 6, 2012

CART Problem Solving: Speech Recognition Part IV

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Test Nerves
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

This video is only tangentially relevant to the post; I just found it adorable.

At long last, the final Speech Recognition installment.

CART PROBLEM:Speech recognition is almost always slower and less accurate than stenographic text entry, but there's a strong cultural push to use it, because it's perceived as cheaper and less complicated than hiring a qualified CART provider.

In the previous three posts, I discussed why speech recognition isn't yet capable of providing accurate text when presented with untrained multispeaker audio. I also spoke a bit about why the common assumption that it would only take a bit more development time and processing power to get to 100% accuracy is based on a misunderstanding of how language works and how speech recognition engines try to capture it.

Just because a lizard can play that bug-squishing iPhone game, it doesn't follow that upgrading the lizard to a cat will make it a champion at Dance Dance Revolution. A bigger speech corpus, faster computers, and even a neural-network pattern matching model still doesn't make up for the essential difference between human and mechanized speech recognition: Humans are able to make use of context and semantic cues; computers are not. Language is full of soundalike words and phrases, and imperfect audio is very much the rule and not the exception in most real-world situations. This means that humans will inevitably have the edge over computers in differentiating ambiguous sound patterns, and the improvements in speech recognition technology will follow an asymptotic trajectory, with each new improvement requiring vastly greater effort to achieve, and the final goal of accurate independent transcription a nearly impossible one, except in controlled settings, with a narrow range of speakers and vocabulary.

But of course there's a huge difference between a professional voice writer and an untrained one, and an even greater difference between any kind of respeaking system and a speaker-independent speech transcription program. Despite widespread public perception, voice writing isn't actually any easier to do than CART, and in fact is usually quite a bit harder in most circumstances.

The supposedly short training period is voice writing's major selling point over steno (aside from the cost of equipment), but from what I can tell, it's not actually true. You can train someone to a moderate degree of accuracy very quickly; all they have to do is speak into the microphone slowly and clearly, and it'll get a fair amount of words correct. For dictation or offline transcription, this can work well, assuming they have the stamina to speak consistently for long periods of time, because they can speak at a slow pace, stop, go back, and correct errors as they make them. Obviously, the closer a person's voice is to the standard paradigm (male, American, baritone), the better results they'll get. Many people with non-standard voices (such as this deaf female blogger) have a heck of a time getting software to understand them, even speaking as slowly and clearly as they can manage. But even for men with American accents, actual live realtime respeaking at CART levels of accuracy (ideally over 99% correct) is much, much harder than dictation.

* Short words are more difficult for the speech engine to recognize than multisyllabic words are, and are more likely to be ignored or mistranscribed.

* If the voice captioner does mostly direct-echo respeaking, meaning that they don't pronounce common words in nonstandard ways, they have to repeat multisyllabic words using the same number of syllables as in the original audio; if they try to "brief" long words by assigning a voice macro that lets them say the word in one syllable, they run up against the software's difficulty in dealing with monosyllabic words that I mentioned above.

* Because they're mostly saying words in the same amount of time as they were originally spoken (unlike in steno, where a multisyllabic word can be represented by a single split-second stroke), they don't have much "reserve speed" to make corrections if the audio is mistranscribed. They also have to verbally insert punctuation and use macros to differentiate between homonyms, which takes time and can be fatiguing.

* Compensating for the lack of reserve speed by speaking the words more quickly than they were originally spoken can be problematic, because the software is better able to transcribe words spoken with clearly delineated spaces between them, as opposed to words that are all run together.

* This means that if the software makes a mistake and the audio is fairly rapid, the voice captioner is forced to choose between taking time to delete the mistake and then catching up by paraphrasing the speaker, or to keep up with the speaker while letting the mistake stand.

* The skill of echoing previously spoken words aloud while listening to a steady stream of incoming words can be quite tricky, especially when the audio quality is less than perfect; unlike simultaneous writing and listening, simultaneous speaking and listening can cause cross-channel interference.

This doesn't even go into the potential changes in a person's voice brought about by fatigue, allergies, colds, or minor day-to-day variations, all of which can wreak havoc with even a well-trained voice engine.

Low or moderate accuracy offline voice writing = short training period; most people can do it.

Low or moderate accuracy realtime voice writing = somewhat longer training period; machine-compatible voice timbre and accent required.

CART-level accuracy realtime voice writing = extremely long training period; an enormous amount of talent and dedication required.

I want to emphasize again that none of this is meant to denigrate the real skill that well-trained voice writers have developed over their years of training. It's just to point out that while voice writer training seems on the surface to be easier and quicker than steno training, that's very seldom the case in practice, as long as appropriate accuracy standards (99% or better) are adhered to. The problem comes in when the people paying for accommodations, either due to a shortage of qualified steno or voice writers, or due to cost considerations, decide that 95% or lower accuracy is "good enough" and that deaf people should be able to "read through the mistakes".

So let's talk about some other potential competitors to CART services. These fall into two general categories: Offline transcription and text expansion. I think I'll leave text expansion for a future series of posts, since it's a fairly complex subject. Offline transcription is much simpler to address.

I've seen several press releases recently from companies bragging about contracts they've secured with universities, that claim to offer verbatim captioning at rock bottom prices. The catch is that the captioning isn't live. No university or conference organizer I know of is foolhardy enough to set completely automated captions up on a large screen in front of the entire audience for everyone to see. The mistakes made by automated engines are far too frequent and hilarious to get away with. But they will, it seems, let lectures be captured by automated engines, then give the rough transcripts to either in-house editors (mostly graduate students) or employees of the lecture-capturing companies, to produce a clean transcript at speeds that are admittedly somewhat better than they used to be, back when making a transcript or synchronized caption file offline usually involved a qwerty typist starting from scratch.

I'm worried that this is starting to be perceived as an appropriate accommodation for students with hearing loss, because there's a crucial piece missing from the equation: Realtime access. Imagine a lecture hall filled with 250 students at a well-regarded American private university, sitting with laptops and notebooks and audio recorders, facing the PowerPoint screen, ready to learn. It's Monday morning. In walks the professor, who pulls up her slideshow and begins the lecture.

PROFESSOR: Tanulmányait a kolozsvári zenekonzervatóriumban, majd a budapesti Zeneakadémián végezte, Farkas Ferenc, Bárdos Lajos, Járdányi Pál és Veress Sándor tanítványaként. Tanulmányai elvégzése után népzenekutatással foglalkozott. Romániában ösztöndíjasként több száz erdélyi magyar népdalt gyűjtött.

After a few seconds, the students start looking at each other in confusion. They don't speak this language. What's going on? The professor continues speaking in this way for 50 minutes, then steps down from the podium and says, "The English translation of the last hour will be available within 48 hours. Please remember that there is a test on Wednesday."

These students are paying $50,000 or $60,0000 a year to attend this school. They're outraged. Not only do they have less than 24 hours to study the transcript before the test, but they were unable to ask questions or to see the slides juxtaposed with the lecture material. Plus they just had to sit there for 50 minutes, bored and confused, without the slightest idea of what was going on. It wouldn't stand. The professor would be forced to conduct future lectures in English rather than Hungarian, or risk losing her job. This is the state of affairs for deaf and hard of hearing students offered transcripts rather than live captioning. It deprives them of an equal opportunity for learning alongside their peers, and it forces them to waste hours of their life in classes that they can't hear and therefore can't benefit from. I'm waiting for the day when the first student accommodated in this way sues their school for violating the Americans with Disabilities Act, and at that point the fast-turnaround transcript and captioning companies are going to be in a good deal of trouble. There is the possibility of training realtime editors who might be able to keep up with the pace of mistakes and correct each error a few seconds after it's made before the realtime is delivered to the student, but that adds yet another person into the workflow, reducing the savings the university was hoping to get when they laid off their CART providers. In some classes, a relatively untrained editor with a qwerty keyboard will be able to zap the errors and clean up the transcript in realtime, but in others -- where the professor doesn't speak Standard Male American (true for a significant and increasing number of professors in the US college system), or there's too much technical jargon, or the noise of the ventilation system interferes with the microphone, or any of a hundred other reasons -- the rate of errors made by the speech engine will outpace the corrections any human editor can make in realtime.

So what lies ahead in the future? Yes, speech recognition engines will continue to improve. Voice writer training times might decrease somewhat, though fully accurate automated systems will stay out of reach. People don't realize that speech is an analogue system, like handwriting. Computer recognition of the printed word has improved dramatically in the past few decades, and even though transcripts produced via OCR still need to be edited, it's become a very useful technology. Recognition of handwriting has lagged far behind, because the whorls and squiggles of each handwritten letter varies drastically from individual to individual and from day to day. There's too much noise and too little unambiguous signal, apart from the meaning of the words themselves, which allows us to decipher in context whether the grocery list reads "buy toothpaste" or "butter the pasta". Human speech is much more like handwriting than it is like print. Steno allows us to produce clear digital signals that can be interpreted and translated with perfect accuracy by any computer with the appropriate lexicon. Speech is an inextricably analogue input system; there will always be fuzz and flutter.


  1. This comment has been removed by a blog administrator.

  2. I wish I could send my experience with my James Joyce group @ Finn and Porter the other day. I was born in Norway, but have lived in USA for 52 years. I have a light axcent, so they say. My friend had an IPhone 5S(or 55), anyway I was facinated by it and all that it could do. So I asked to borrow it. My first question was :"Give a review on the DVD of "As I lay Dying". Answer :"I don't understand" Well, I thought, maybe that was too complicated (though my friend asked the same question and got an answer) So I asked:" Give me a name of a Best Seller Book?) I did not even have a chanse to say the first word before the IPad said: I don't understand" It was hallairious. The IP answered every bodys questions,but just touching the I P, it immediatly said:" I don't understand".I kept trying, but after awhile we were all laughing so much, my friend just had to turn it off.
    Karin Knight

  3. I love steno! I am actually thinking that with all the difficulty these "speech recognition" companies are having trying to make these systems work, it actually shines a brighter light on what exactly it is that we stenos are actually doing...not just putting words on paper (or screens)...that there is actually a lot of quick-thinking and a wide knowledge base (worldly knowledge) that goes along with creating live text!! stenos are irreplaceable! stenos are awesome!!

  4. I just read an article written by a voice writer titled "The Cognitive Challenge of Voice Writing." it was so spot-on. it is the same principles involved with steno, and he does mention his admiration for stenos.
    I wish more people would realize just what it takes..his article (forget his name right now) and your articles go a long way in helping more to "get it." thanks.