As I mentioned over on the Plover Blog, I recently went to check out the vendor exhibit hall of SpeechTek 2010. It was quite interesting and ultimately very encouraging to get a firsthand look at what current top-of-the-line commercial speech recognition technology can and cannot do. But before I get into the details, let me back up.
Two weeks ago I had lunch with an old friend from high school, Matthew Maslanka of Maslanka Music Prep. He's a fellow freelancer who moved to New York City this February to start a music engraving business, and in only a few months he's already enjoying overwhelming success. Though the parallels might not be immediately obvious, music engraving and CART actually have a lot in common. When a composer writes down music, they don't always think about the musician who later has to play that music, often with little or no opportunity to review it beforehand. Music notation programs can make the job quicker, but they're often not able to apply the complicated and somewhat subjective rules that make sheet music legible. The engraver's job is to edit and tweak the computerized notation file so that the music is crystal clear, without squished-together phrases, rhythmic ambiguities, or impractical page breaks. As Matthew says,
Rehearsals and recording sessions can cost hundreds of dollars per minute. Confusion about any aspect of the printed page makes performers concentrate on everything but making music. Composers, conductors and performers have better things to do than worry about which side of the time signature the forward-repeat bar goes. When you choose a professional, you choose to make the best music possible in the most efficient way.
So, like me with my steno machine, Matthew uses a computer to translate a composer's thoughts into perfectly formatted print in a fraction of the time it used to take for an engraver to literally engrave copper plates for the printing press. Without computers, our jobs would be impossible. Likewise, computers are unable to replace us entirely. They're simply unable to internalize and understand the countless nuances of spoken English in my case and of musical notation in his. As with voice recognition software, the default settings in musical notation software will produce something that can be turned into a final product by a human editor, but it's not able to do so with the accuracy or finesse required to be truly useful. I asked him whether anyone tried to use the software out of the box without going through a professional engraver, and he said that some churches distributed their worship music that way, since it was fairly simple and they didn't have the budget for much else. But virtually any organization employing professional musicians were happy to pay his fees, because it saved them money to have music that was well laid out and didn't cause confusion.
Unfortunately, that's where CART and engraving diverge; because consumers of CART often aren't in a position to pay for the CART themselves, they're usually obliged to get funding from their school or workplace, and because the people paying for the service aren't the ones who use it, they tend to lower costs however they can, even if it means lowering standards as well. I'm going to revisit this topic many times over the course of this blog, but this is the crux of the current state of CART and captioning, and it's not going to be easy to resolve. Companies that hire professional musicians have a financial incentive to make sure that their employees get the best transcription available. Companies that hire Deaf or hard of hearing employees and colleges that accept Deaf/HoH students often think that it's in their financial self-interest to get away with the lowest-cost accommodation, even if that means that their employee or student isn't getting truly equal access. Why do they think this way? And how can CART providers, ASL interpreters, and consumers demonstrate to them that they're wrong, that providing equal rather than substandard access is the best way to maintain efficiency and secure the success of the entire enterprise? Appealing to the letter of the law is rarely as effective as appealing to institutional self-interest. We just have to figure out the best way to make our point.
Anyway, back to SpeechTek. There were 43 exhibitors in the hall, and only one, Autonomy, offered natural language transcription. Another one, PhoneTag, offered a combination of speech recognition transcription with human editing. A third, LumenVox, said that a voice recognition dictation program along the lines of Dragon NaturallySpeaking, the current market leader, was "coming soon". All the other companies were offering software that either recognized voice commands from a presupplied list or were able to find words or phrases of, ideally, 6 to 8 phonemes apiece, such as "I'm not happy" or "speak with your supervisor" from audio recordings of phone calls. I had the pleasure of speaking with representatives from a number of these companies, and all of them assured me that speaker-independent natural language transcription was not even close to being on the horizon.
The closest contender, Autonomy, whose software is reputedly used by top governmental agencies, offered a demo of its software transcribing BBC news in near-realtime (a delay of about 20 minutes, if I remember correctly.) Sadly, this demo isn't available to the general public, but I watched it for quite a while, first with the sound off, then turning it on, and while it was vastly better than YouTube's autocaptioning, it was still considerably below an acceptable accuracy threshold. Even with broadcast quality audio and the crisp, clear diction that BBC commentators are famous for, the profusion of mistranscriptions and semantic red herrings severely compromised the quality of the caption feed. For example, at one point the captioning randomly inserted the word "Stalin" into an otherwise innocuous sentence about a London schoolteacher. I was listening to the audio, and I'm quite sure that nothing even resembling "Stalin" had been said, but it's just not possible to predict what a computer is going to make out of any given audio snippet. This is the "black box" problem I described in Part Six of What is Steno Good For, over on the Plover Blog. When speech recognition works, it works, but when it fails, it has no ability to recognize or correct that failure, because the algorithms used to translate audio into text are not transparent.
I asked a representative from one of the phoneme-sifting companies how his software worked. He said that if you wanted to find a word or phrase in a large amount of audio, you could either dial the accuracy down so that it gave you a number of false positives to sift through, or you could dial it way up, which meant that the program was less likely to waste your time with false hits, but the probability that it would miss the phrase you were looking for was greatly increased. Setting it somewhere in the middle split the difference, but that meant you potentially had to deal with both false positives and missed hits. I asked him: If you had a person who pronounced a word in a particularly unique or characteristic way, could you search for that voice pattern specifically, rather than the normalized, averaged voice pattern of their Standard American English corpus? He said that they had plugins for several different dialects, but that individual voice patterns were not accessible in that way, so you couldn't just make up custom voice searches for particular individuals.
The most excited representative I talked to demonstrated how he was able to take the software underpinning his company's automated voice recognition engine and make it work with text messages rather than phone calls, which customers were thrilled about, and which resulted in increased accuracy across the board. That shouldn't be surprising, and given that nearly all of my peers prefer transacting business over the web or via text message rather than speaking either to a customer service representative or a phone robot, I'm interested to see where the voice recognition industry will be in a few years, since, going by SpeechTek at least, the majority of applications seemed to be focused on handling commercial phone interactions.
What is unquestionable is that realtime transcription for the Deaf and hard of hearing in a natural language setting without a human intermediary isn't anywhere close to viability. Voice writing using dictation software is a different matter altogether, and I'll get to that in future blog posts. But the big bugbear of stenography, that human transcription will soon be entirely replaced by software, can be conclusively put aside. The question remains whether highly trained and well-paid CART providers will be able to hold out against a general push towards lower-cost voice writers with less training, and that goes back to the discussion of standards set by direct consumers versus standards set by the ones who are obliged to pay for something they don't use themselves. I've got a lot more to say on the subject, but I'm interested to hear your thoughts. CART providers: Do you wake up from nightmares of speech recognition stealing your livelihood? CART consumers: Have you ever worked with a respeaker, CapTel/CaptionMic operator, or other voice recognition system? How do you feel about the current state of voice writing compared to CART?