I might return to the CART Problem Series in the future, but I think the 18 posts that I've made on the subject will do for now. At least for a little while, I'm going to go back to making posts on various and sundry subjects as they occur to me, rather than just framing them as problems and solutions.
I mentioned in my Communicating Sans Steno post that my dad has had significant hearing loss for as long as I can remember, but since he was largely in denial about it for most of that time, I didn't learn much about how hearing loss worked or what it was like to deal with in daily life, so I'm ashamed to admit that I've been staggeringly insensitive to at least two hard of hearing people I've known over the years.
The first was when I was a teenager. Her parents were friends with my parents, so she and her brother used to come over to my house when we had parties, and sometimes I'd go to theirs. We went to different schools, so we only saw each other a few times a year, but we always had a good time when we got together. I didn't notice that she wore hearing aids until several years after we first met, when she mentioned that she'd won an essay scholarship for teenagers with hearing loss. Like the ignorant blunderer I was, I said, "Wait, you have hearing loss?" For the first time I noticed the aids. "But you wear hearing aids." "Yes," she replied patiently. "So... Why do you qualify for a scholarship if your hearing aids have already fixed the problem?" Like so many people, I'd assumed that if my eyeglasses were able to correct my severe myopia to normal vision, then hearing aids would be able to do the same thing for anything short of total deafness. I had no idea until almost 20 years later that amplification often doesn't improve clarity, that some frequencies can be incapable of amplification due to permanent loss of specific cochlear hair cells, that hearing is an extremely complex mechanism that doesn't have an easy or complete fix when any of its components malfunction. My misrefracting cornea could be completely compensated for by a piece of light-bending plastic. Even with hearing aids, my friend's hearing loss remained something she needed to reckon with.
I'd never noticed her misunderstanding me or asking me to repeat myself when we talked (the way my dad often did), and I didn't realize that the casual one-on-one conversations we had at parties were totally unlike her situation in the classroom, where she was learning new material, sat several feet away from the teacher (losing any ability to lipread, especially since the teacher faced the board most of the time), and was forced to work twice as hard as her classmates to get the same amount of information through her ears and into her brain. The fact that my friend had managed to do this all her life, getting excellent grades and becoming an extremely literate and eloquent writer, totally blew past me. I took it for granted; instead of congratulating my friend on her essay, I was rude and dismissive. I haven't seen my old friend since high school, but if I ever run into her again, I'll apologize and explain that I know a lot more now than I did then -- not that that's any excuse. If I had actually asked her to tell me more about the scholarship instead of assuming that it made no sense, I would have learned something that day, instead of having to wait 20 years to realize how much of a jerk I'd been.
The second incident is even more problematic, because I was in a position of authority. At my college, all sophomores are required to take a year of music theory, even though their degree (there's only one on offer) is in Liberal Arts. Music classes are led by professional instructors, but there are also weekly practicum classes, where students are supposed to try out what they've learned in small 4-to-5-person groups. Students with musical experience are chosen to lead those groups as work-study assignments, and because I'd played in the pit orchestra of a summer repertory theater, I got to be one of them. My job involved drilling the students in singing simple multipart songs and rounds, helping them to analyze counterpoint examples discussed in class, and answering any questions they had about the stuff they were studying. The emphasis was on getting an intellectual understanding of the music rather than in becoming accomplished performers, so it wasn't a problem that a few students in each practicum were tone deaf. Most of them just hadn't been exposed to much formal music training, and once I gave them a few exercises, their pitch discrimination and singing tended to improve quite a bit.
There was one student, though, who found both the music class and the practicum intensely frustrating. I noticed his hearing aids right away, because he'd decorated the earmolds in bright colors. He was forthcoming about his hearing loss, and explained that he got very little out of all the singing, analysis, and call-and-response pitch practice, because he couldn't hear any of it accurately enough to duplicate. Again, I assumed that the hearing aids should have solved the problem, and didn't understand what his issue was. He wasn't being graded on his accuracy in singing, and he wouldn't be penalized if he wasn't able to appreciate the aesthetic nuances of the songs. All he had to do was understand the mathematics of the music on the page, so that he could speak about it in class. The singing exercises were just intended to help build first-hand experience with hearing and repeating music in realtime. I figured that his hearing loss put him in the category of the "tone deaf" students, and treated him accordingly. I didn't realize that, unlike them, the problem wasn't his ability to distinguish the notes on an intellectual level. Unlike them, he wasn't going to improve with practice. He couldn't hear the difference between pitches no matter how many times they were repeated, so he felt like he was being forced to bang his head against a wall every week in practicum. When he expressed his frustration to me, I thought he was being oversensitive, and just reassured him that it wouldn't affect his grade even if he didn't improve by the end of the semester. I didn't realize the emotional consequences of being asked to do something you weren't physically able to do every week in front of your peers, over and over, and failing every time. He eventually wound up transferring to another college, and I'm afraid that my inability to understand what he was telling me played into that decision.
Like any good essay writer, I've Googlestalked both of these people as research for this post, and today they're both extremely successful and well-respected in their fields. Obviously my ignorance didn't stop them from doing what they wanted to do. But when you add my ignorance to the ignorance of everyone else they had to deal with, how much more exhausting, frustrating, annoying, infuriating did it make their educational experiences, not to mention other parts of their lives? If I hadn't gone into CART, I never would have realized the mistakes I'd made in trusting my own assumptions instead of listening to their experiences. Now I do, and I'm mortified when I think of the way I behaved. There's no easy solution to this problem. One out of every seven people in this country have some degree of hearing loss, and yet so few people actually understand how it works. It'll take a lot to educate all 312 million people about the 45 million who are Deaf, deafened, or hard of hearing, but it badly needs to be done.
Friday, December 21, 2012
Thursday, December 6, 2012
CART Problem Solving: Speech Recognition Part IV
CART Problem Solving Series
Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV
This video is only tangentially relevant to the post; I just found it adorable.
At long last, the final Speech Recognition installment.
CART PROBLEM:Speech recognition is almost always slower and less accurate than stenographic text entry, but there's a strong cultural push to use it, because it's perceived as cheaper and less complicated than hiring a qualified CART provider.
In the previous three posts, I discussed why speech recognition isn't yet capable of providing accurate text when presented with untrained multispeaker audio. I also spoke a bit about why the common assumption that it would only take a bit more development time and processing power to get to 100% accuracy is based on a misunderstanding of how language works and how speech recognition engines try to capture it.
Just because a lizard can play that bug-squishing iPhone game, it doesn't follow that upgrading the lizard to a cat will make it a champion at Dance Dance Revolution. A bigger speech corpus, faster computers, and even a neural-network pattern matching model still doesn't make up for the essential difference between human and mechanized speech recognition: Humans are able to make use of context and semantic cues; computers are not. Language is full of soundalike words and phrases, and imperfect audio is very much the rule and not the exception in most real-world situations. This means that humans will inevitably have the edge over computers in differentiating ambiguous sound patterns, and the improvements in speech recognition technology will follow an asymptotic trajectory, with each new improvement requiring vastly greater effort to achieve, and the final goal of accurate independent transcription a nearly impossible one, except in controlled settings, with a narrow range of speakers and vocabulary.
But of course there's a huge difference between a professional voice writer and an untrained one, and an even greater difference between any kind of respeaking system and a speaker-independent speech transcription program. Despite widespread public perception, voice writing isn't actually any easier to do than CART, and in fact is usually quite a bit harder in most circumstances.
The supposedly short training period is voice writing's major selling point over steno (aside from the cost of equipment), but from what I can tell, it's not actually true. You can train someone to a moderate degree of accuracy very quickly; all they have to do is speak into the microphone slowly and clearly, and it'll get a fair amount of words correct. For dictation or offline transcription, this can work well, assuming they have the stamina to speak consistently for long periods of time, because they can speak at a slow pace, stop, go back, and correct errors as they make them. Obviously, the closer a person's voice is to the standard paradigm (male, American, baritone), the better results they'll get. Many people with non-standard voices (such as this deaf female blogger) have a heck of a time getting software to understand them, even speaking as slowly and clearly as they can manage. But even for men with American accents, actual live realtime respeaking at CART levels of accuracy (ideally over 99% correct) is much, much harder than dictation.
* Short words are more difficult for the speech engine to recognize than multisyllabic words are, and are more likely to be ignored or mistranscribed.
* If the voice captioner does mostly direct-echo respeaking, meaning that they don't pronounce common words in nonstandard ways, they have to repeat multisyllabic words using the same number of syllables as in the original audio; if they try to "brief" long words by assigning a voice macro that lets them say the word in one syllable, they run up against the software's difficulty in dealing with monosyllabic words that I mentioned above.
* Because they're mostly saying words in the same amount of time as they were originally spoken (unlike in steno, where a multisyllabic word can be represented by a single split-second stroke), they don't have much "reserve speed" to make corrections if the audio is mistranscribed. They also have to verbally insert punctuation and use macros to differentiate between homonyms, which takes time and can be fatiguing.
* Compensating for the lack of reserve speed by speaking the words more quickly than they were originally spoken can be problematic, because the software is better able to transcribe words spoken with clearly delineated spaces between them, as opposed to words that are all run together.
* This means that if the software makes a mistake and the audio is fairly rapid, the voice captioner is forced to choose between taking time to delete the mistake and then catching up by paraphrasing the speaker, or to keep up with the speaker while letting the mistake stand.
* The skill of echoing previously spoken words aloud while listening to a steady stream of incoming words can be quite tricky, especially when the audio quality is less than perfect; unlike simultaneous writing and listening, simultaneous speaking and listening can cause cross-channel interference.
This doesn't even go into the potential changes in a person's voice brought about by fatigue, allergies, colds, or minor day-to-day variations, all of which can wreak havoc with even a well-trained voice engine.
Low or moderate accuracy offline voice writing = short training period; most people can do it.
Low or moderate accuracy realtime voice writing = somewhat longer training period; machine-compatible voice timbre and accent required.
CART-level accuracy realtime voice writing = extremely long training period; an enormous amount of talent and dedication required.
I want to emphasize again that none of this is meant to denigrate the real skill that well-trained voice writers have developed over their years of training. It's just to point out that while voice writer training seems on the surface to be easier and quicker than steno training, that's very seldom the case in practice, as long as appropriate accuracy standards (99% or better) are adhered to. The problem comes in when the people paying for accommodations, either due to a shortage of qualified steno or voice writers, or due to cost considerations, decide that 95% or lower accuracy is "good enough" and that deaf people should be able to "read through the mistakes".
So let's talk about some other potential competitors to CART services. These fall into two general categories: Offline transcription and text expansion. I think I'll leave text expansion for a future series of posts, since it's a fairly complex subject. Offline transcription is much simpler to address.
I've seen several press releases recently from companies bragging about contracts they've secured with universities, that claim to offer verbatim captioning at rock bottom prices. The catch is that the captioning isn't live. No university or conference organizer I know of is foolhardy enough to set completely automated captions up on a large screen in front of the entire audience for everyone to see. The mistakes made by automated engines are far too frequent and hilarious to get away with. But they will, it seems, let lectures be captured by automated engines, then give the rough transcripts to either in-house editors (mostly graduate students) or employees of the lecture-capturing companies, to produce a clean transcript at speeds that are admittedly somewhat better than they used to be, back when making a transcript or synchronized caption file offline usually involved a qwerty typist starting from scratch.
I'm worried that this is starting to be perceived as an appropriate accommodation for students with hearing loss, because there's a crucial piece missing from the equation: Realtime access. Imagine a lecture hall filled with 250 students at a well-regarded American private university, sitting with laptops and notebooks and audio recorders, facing the PowerPoint screen, ready to learn. It's Monday morning. In walks the professor, who pulls up her slideshow and begins the lecture.
PROFESSOR: Tanulmányait a kolozsvári zenekonzervatóriumban, majd a budapesti Zeneakadémián végezte, Farkas Ferenc, Bárdos Lajos, Járdányi Pál és Veress Sándor tanítványaként. Tanulmányai elvégzése után népzenekutatással foglalkozott. Romániában ösztöndíjasként több száz erdélyi magyar népdalt gyűjtött.
After a few seconds, the students start looking at each other in confusion. They don't speak this language. What's going on? The professor continues speaking in this way for 50 minutes, then steps down from the podium and says, "The English translation of the last hour will be available within 48 hours. Please remember that there is a test on Wednesday."
These students are paying $50,000 or $60,0000 a year to attend this school. They're outraged. Not only do they have less than 24 hours to study the transcript before the test, but they were unable to ask questions or to see the slides juxtaposed with the lecture material. Plus they just had to sit there for 50 minutes, bored and confused, without the slightest idea of what was going on. It wouldn't stand. The professor would be forced to conduct future lectures in English rather than Hungarian, or risk losing her job. This is the state of affairs for deaf and hard of hearing students offered transcripts rather than live captioning. It deprives them of an equal opportunity for learning alongside their peers, and it forces them to waste hours of their life in classes that they can't hear and therefore can't benefit from. I'm waiting for the day when the first student accommodated in this way sues their school for violating the Americans with Disabilities Act, and at that point the fast-turnaround transcript and captioning companies are going to be in a good deal of trouble. There is the possibility of training realtime editors who might be able to keep up with the pace of mistakes and correct each error a few seconds after it's made before the realtime is delivered to the student, but that adds yet another person into the workflow, reducing the savings the university was hoping to get when they laid off their CART providers. In some classes, a relatively untrained editor with a qwerty keyboard will be able to zap the errors and clean up the transcript in realtime, but in others -- where the professor doesn't speak Standard Male American (true for a significant and increasing number of professors in the US college system), or there's too much technical jargon, or the noise of the ventilation system interferes with the microphone, or any of a hundred other reasons -- the rate of errors made by the speech engine will outpace the corrections any human editor can make in realtime.
So what lies ahead in the future? Yes, speech recognition engines will continue to improve. Voice writer training times might decrease somewhat, though fully accurate automated systems will stay out of reach. People don't realize that speech is an analogue system, like handwriting. Computer recognition of the printed word has improved dramatically in the past few decades, and even though transcripts produced via OCR still need to be edited, it's become a very useful technology. Recognition of handwriting has lagged far behind, because the whorls and squiggles of each handwritten letter varies drastically from individual to individual and from day to day. There's too much noise and too little unambiguous signal, apart from the meaning of the words themselves, which allows us to decipher in context whether the grocery list reads "buy toothpaste" or "butter the pasta". Human speech is much more like handwriting than it is like print. Steno allows us to produce clear digital signals that can be interpreted and translated with perfect accuracy by any computer with the appropriate lexicon. Speech is an inextricably analogue input system; there will always be fuzz and flutter.
Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV
This video is only tangentially relevant to the post; I just found it adorable.
At long last, the final Speech Recognition installment.
CART PROBLEM:Speech recognition is almost always slower and less accurate than stenographic text entry, but there's a strong cultural push to use it, because it's perceived as cheaper and less complicated than hiring a qualified CART provider.
In the previous three posts, I discussed why speech recognition isn't yet capable of providing accurate text when presented with untrained multispeaker audio. I also spoke a bit about why the common assumption that it would only take a bit more development time and processing power to get to 100% accuracy is based on a misunderstanding of how language works and how speech recognition engines try to capture it.
Just because a lizard can play that bug-squishing iPhone game, it doesn't follow that upgrading the lizard to a cat will make it a champion at Dance Dance Revolution. A bigger speech corpus, faster computers, and even a neural-network pattern matching model still doesn't make up for the essential difference between human and mechanized speech recognition: Humans are able to make use of context and semantic cues; computers are not. Language is full of soundalike words and phrases, and imperfect audio is very much the rule and not the exception in most real-world situations. This means that humans will inevitably have the edge over computers in differentiating ambiguous sound patterns, and the improvements in speech recognition technology will follow an asymptotic trajectory, with each new improvement requiring vastly greater effort to achieve, and the final goal of accurate independent transcription a nearly impossible one, except in controlled settings, with a narrow range of speakers and vocabulary.
But of course there's a huge difference between a professional voice writer and an untrained one, and an even greater difference between any kind of respeaking system and a speaker-independent speech transcription program. Despite widespread public perception, voice writing isn't actually any easier to do than CART, and in fact is usually quite a bit harder in most circumstances.
The supposedly short training period is voice writing's major selling point over steno (aside from the cost of equipment), but from what I can tell, it's not actually true. You can train someone to a moderate degree of accuracy very quickly; all they have to do is speak into the microphone slowly and clearly, and it'll get a fair amount of words correct. For dictation or offline transcription, this can work well, assuming they have the stamina to speak consistently for long periods of time, because they can speak at a slow pace, stop, go back, and correct errors as they make them. Obviously, the closer a person's voice is to the standard paradigm (male, American, baritone), the better results they'll get. Many people with non-standard voices (such as this deaf female blogger) have a heck of a time getting software to understand them, even speaking as slowly and clearly as they can manage. But even for men with American accents, actual live realtime respeaking at CART levels of accuracy (ideally over 99% correct) is much, much harder than dictation.
* Short words are more difficult for the speech engine to recognize than multisyllabic words are, and are more likely to be ignored or mistranscribed.
* If the voice captioner does mostly direct-echo respeaking, meaning that they don't pronounce common words in nonstandard ways, they have to repeat multisyllabic words using the same number of syllables as in the original audio; if they try to "brief" long words by assigning a voice macro that lets them say the word in one syllable, they run up against the software's difficulty in dealing with monosyllabic words that I mentioned above.
* Because they're mostly saying words in the same amount of time as they were originally spoken (unlike in steno, where a multisyllabic word can be represented by a single split-second stroke), they don't have much "reserve speed" to make corrections if the audio is mistranscribed. They also have to verbally insert punctuation and use macros to differentiate between homonyms, which takes time and can be fatiguing.
* Compensating for the lack of reserve speed by speaking the words more quickly than they were originally spoken can be problematic, because the software is better able to transcribe words spoken with clearly delineated spaces between them, as opposed to words that are all run together.
* This means that if the software makes a mistake and the audio is fairly rapid, the voice captioner is forced to choose between taking time to delete the mistake and then catching up by paraphrasing the speaker, or to keep up with the speaker while letting the mistake stand.
* The skill of echoing previously spoken words aloud while listening to a steady stream of incoming words can be quite tricky, especially when the audio quality is less than perfect; unlike simultaneous writing and listening, simultaneous speaking and listening can cause cross-channel interference.
This doesn't even go into the potential changes in a person's voice brought about by fatigue, allergies, colds, or minor day-to-day variations, all of which can wreak havoc with even a well-trained voice engine.
Low or moderate accuracy offline voice writing = short training period; most people can do it.
Low or moderate accuracy realtime voice writing = somewhat longer training period; machine-compatible voice timbre and accent required.
CART-level accuracy realtime voice writing = extremely long training period; an enormous amount of talent and dedication required.
I want to emphasize again that none of this is meant to denigrate the real skill that well-trained voice writers have developed over their years of training. It's just to point out that while voice writer training seems on the surface to be easier and quicker than steno training, that's very seldom the case in practice, as long as appropriate accuracy standards (99% or better) are adhered to. The problem comes in when the people paying for accommodations, either due to a shortage of qualified steno or voice writers, or due to cost considerations, decide that 95% or lower accuracy is "good enough" and that deaf people should be able to "read through the mistakes".
So let's talk about some other potential competitors to CART services. These fall into two general categories: Offline transcription and text expansion. I think I'll leave text expansion for a future series of posts, since it's a fairly complex subject. Offline transcription is much simpler to address.
I've seen several press releases recently from companies bragging about contracts they've secured with universities, that claim to offer verbatim captioning at rock bottom prices. The catch is that the captioning isn't live. No university or conference organizer I know of is foolhardy enough to set completely automated captions up on a large screen in front of the entire audience for everyone to see. The mistakes made by automated engines are far too frequent and hilarious to get away with. But they will, it seems, let lectures be captured by automated engines, then give the rough transcripts to either in-house editors (mostly graduate students) or employees of the lecture-capturing companies, to produce a clean transcript at speeds that are admittedly somewhat better than they used to be, back when making a transcript or synchronized caption file offline usually involved a qwerty typist starting from scratch.
I'm worried that this is starting to be perceived as an appropriate accommodation for students with hearing loss, because there's a crucial piece missing from the equation: Realtime access. Imagine a lecture hall filled with 250 students at a well-regarded American private university, sitting with laptops and notebooks and audio recorders, facing the PowerPoint screen, ready to learn. It's Monday morning. In walks the professor, who pulls up her slideshow and begins the lecture.
PROFESSOR: Tanulmányait a kolozsvári zenekonzervatóriumban, majd a budapesti Zeneakadémián végezte, Farkas Ferenc, Bárdos Lajos, Járdányi Pál és Veress Sándor tanítványaként. Tanulmányai elvégzése után népzenekutatással foglalkozott. Romániában ösztöndíjasként több száz erdélyi magyar népdalt gyűjtött.
After a few seconds, the students start looking at each other in confusion. They don't speak this language. What's going on? The professor continues speaking in this way for 50 minutes, then steps down from the podium and says, "The English translation of the last hour will be available within 48 hours. Please remember that there is a test on Wednesday."
These students are paying $50,000 or $60,0000 a year to attend this school. They're outraged. Not only do they have less than 24 hours to study the transcript before the test, but they were unable to ask questions or to see the slides juxtaposed with the lecture material. Plus they just had to sit there for 50 minutes, bored and confused, without the slightest idea of what was going on. It wouldn't stand. The professor would be forced to conduct future lectures in English rather than Hungarian, or risk losing her job. This is the state of affairs for deaf and hard of hearing students offered transcripts rather than live captioning. It deprives them of an equal opportunity for learning alongside their peers, and it forces them to waste hours of their life in classes that they can't hear and therefore can't benefit from. I'm waiting for the day when the first student accommodated in this way sues their school for violating the Americans with Disabilities Act, and at that point the fast-turnaround transcript and captioning companies are going to be in a good deal of trouble. There is the possibility of training realtime editors who might be able to keep up with the pace of mistakes and correct each error a few seconds after it's made before the realtime is delivered to the student, but that adds yet another person into the workflow, reducing the savings the university was hoping to get when they laid off their CART providers. In some classes, a relatively untrained editor with a qwerty keyboard will be able to zap the errors and clean up the transcript in realtime, but in others -- where the professor doesn't speak Standard Male American (true for a significant and increasing number of professors in the US college system), or there's too much technical jargon, or the noise of the ventilation system interferes with the microphone, or any of a hundred other reasons -- the rate of errors made by the speech engine will outpace the corrections any human editor can make in realtime.
So what lies ahead in the future? Yes, speech recognition engines will continue to improve. Voice writer training times might decrease somewhat, though fully accurate automated systems will stay out of reach. People don't realize that speech is an analogue system, like handwriting. Computer recognition of the printed word has improved dramatically in the past few decades, and even though transcripts produced via OCR still need to be edited, it's become a very useful technology. Recognition of handwriting has lagged far behind, because the whorls and squiggles of each handwritten letter varies drastically from individual to individual and from day to day. There's too much noise and too little unambiguous signal, apart from the meaning of the words themselves, which allows us to decipher in context whether the grocery list reads "buy toothpaste" or "butter the pasta". Human speech is much more like handwriting than it is like print. Steno allows us to produce clear digital signals that can be interpreted and translated with perfect accuracy by any computer with the appropriate lexicon. Speech is an inextricably analogue input system; there will always be fuzz and flutter.
Subscribe to:
Posts (Atom)