StenoKnight CART Blog

Saturday, January 12, 2013

Akrasia

This isn't a post about hearing loss or steno. It might be relevant if you're a freelancer, though, or possibly even if you're not. To be perfectly honest, this post is a meta-post. See, I've got a goal in Beeminder that makes me blog once a week (on either the StenoKnight Blog or the Plover Blog), and I'm currently 4 hours from derailment. When I finish this post, it'll reset and I won't have to blog again for another 7 days, though I'm going to try not to let it skirt so close to the edge next time.

What is Beeminder? So glad you asked, since over the past year or so I've turned into a frothing convert. It's an online self-binding tool. What's self-binding? It's a technique to help guard against akrasia. What's akrasia? Ah, now we're talking. So I'm firmly of the school that believes happiness is not a state of mind, but a habit. Doing things consistently, improving at them incrementally, and eventually realizing that you've become pretty good at them and that their presence in your life is actually really satisfying. Akrasia is the force working against all of that. It's what makes you break your resolutions, dodge your commitments, succumb to entropy, and spend your entire life on the couch eating Goldfish crackers and playing video games. Self-binding is a really effective way to make you realize that you're not getting as much done as you think you are, and to keep you honest as you slowly and gradually lay the groundwork for the great things you want to achieve in the future.

I've found that having a system to nudge me when I'm not doing anything to help with my long-term goals is really helpful. Similarly, having a graph to show my history of doing the things I want to do over the long term makes me feel like I'm actually accomplishing something, and gives me more incentive to keep it up even when my initial enthusiasm for it has waned somewhat. I've found Beeminder to be a really good way to combine both the nudge and the graph. It's free to make commitments, but they get their money from people who break their commitments and then want to try again. Currently I've got eight open goals:

* Blog more often
* Eat more fruits and vegetables
* Go to the gym
* Learn Python with Codecademy
* Answer my email on a regular basis
* Eat less junk food
* Practice for the RMR Q&A (Yeah, just learned that I didn't pass my last attempt, sigh. Next time for sure, though, if I can keep those blasted nerves in check. I've definitely got the speed.)
* Sort through my email every day

I'll probably add another goal pretty soon so I can have a nice even grid of nine, but I'm not sure what it'll be yet. (Suggestions welcome!) For now, I'm managing them all pretty well. I haven't had to pay Beeminder money yet. The Gmail Zero goal is especially nice because I don't even have to enter my data manually; it just counts all the read messages that are in my inbox several times a day, and if I don't clean it out completely at least once a day, I lose. My Reply Zero goal, on the other hand, is a bit more flexible; I send it to my reply folder, and I have to get it down to zero at least once a week.

Anyone who finds that their daily habits are not quite what they should be, or that they're not achieving their long-term goals in the timeframe that they'd like, might like to try Beeminder out for a while. I don't receive any compensation from them; they've just helped me enormously with building good habits, and I like spreading the word about them. Let me know if you start up your own goals! I'm always curious about what things people are invested enough in to start tracking.

Okay! Blogging goal achieved for another week. See you guys in fewer than six days! (I hope.)

Saturday, January 5, 2013

Augmented Reality Captioning

So this December, for the third time in a row, I captioned the New York Public Library's Holiday Songbook via closed device captioning. I wrote a post about it from 2010, talking about its pros and cons, but here's an actual candid shot from the audience, taken by a caption viewer using an iPad.

It's definitely better than nothing, but as I mentioned in the original post, it has significant drawbacks, such as having to constantly look away from the stage and adjust your eyes between far vision and near vision. In performance situations like this one, open captioning is usually far preferable, since it's on the same plane as the people on stage. For the same reason, many people who use captions in movie theaters prefer open captions to the Captiview devices that sit in a viewer's cupholder. But there's a new accommodation that's starting to be used in both live theater and cinema: caption glasses.

I've heard mixed reviews from Deaf and hard of hearing patrons who've used these glasses. On the one hand, it's nice to be able to have the captions superimposed over the picture. On the other hand, they're apparently quite heavy and bulky (as this cartoon from That Deaf Guy gets across so well), which can cause neck, nose, and ear pain, since you apparently have to hold stock still or else the captions jump around all over the screen.

I've been interested in augmented reality for a long time, both in its application in CART and captioning specifically and in the possibility of being able to compose text (blog posts, novels, emails, etc.) while walking without bumping into things. It's been on a mostly theoretical level up 'til now; when I wrote my What Is Steno Good For: Mobile and Wearable Computing article, I said "It's a problem that still hasn't been solved to anyone's satisfaction, even after several decades of trying. They're too heavy, too fragile, too stupid-looking, too headache-inducing. But let's posit that someday soon the problem will be solved, and we'll be able to go out and buy lightweight, stylish augmented reality overlay monitors that look just like ordinary pairs of eyeglasses."

Well, the problem might not have been solved completely; there are a few AR glasses such as:

The Vuzix M100

and

Google Glass

Which are currently releasing prototypes to developers, and which are expected to have widespread commercial release in 2014. As you can see, they don't look just like ordinary eyeglasses, and it's unclear to what extent the eyestrain issues that seem to have been endemic in all previous AR solutions have been fixed. It will probably take a few iterations to iron out all the kinks.

Thing is, it's all gotten a lot less theoretical to me recently. I'm currently working with a first-year dental student. When he becomes a second-year dental student, he will begin working in the clinic with actual patients. Having a tablet display mounted somewhere near the chair is not a very good solution; he'll have to look back and forth between the patient and the tablet, and it will probably be both awkward and inconvenient when trying to do his job. A much better solution would be a pair of AR glasses, and I'm actively looking into my options there. I've joined the AR Glasses LinkedIn group and I've been doing a bit of research on my own, with an eye to purchasing a pair (probably as a developer, since consumer models are still at least a year away) sometime this summer. Of course, like any technology, there's that paradox where the longer you wait, the better the technology you're going to get, but the more you put it off, the less time you have to make adjustments and adaptations so that it works the way it's supposed to. I don't know whether I'll be able to make it work out of the box -- just bring up a browser window, adjust it so that it fills the lower 1/8th or so of the visual field, and have the captions displaying from StreamText without any other modifications -- or if I'll have to commission bespoke software to get my captions onto the glasses without completely obscuring my client's vision.

I'm also concerned about WiFi strength, battery life, physical comfort, and the all-important eyestrain. I wish I could just go try out both the Vuzix and the Glass head-to-head, asking questions from their manufacturers, but I suspect it'll be a bit trickier to get the information I need, and if I wind up buying one of them, it'll be at least $1,500 or so, plus any fees involved in getting customized caption-display software written. Also, dental clinics can be messy places. If the glasses get spattered with water or other less savory substances, can they be cleaned and sterilized without damaging them? Will my client's rapport with the patients be compromised by wearing this strange-looking headgear, or will it blend in with the rest of the dental equipment? There are a lot of questions still to resolve. I'll keep you all updated as I go along.

Friday, December 21, 2012

Ignorance

I might return to the CART Problem Series in the future, but I think the 18 posts that I've made on the subject will do for now. At least for a little while, I'm going to go back to making posts on various and sundry subjects as they occur to me, rather than just framing them as problems and solutions.

I mentioned in my Communicating Sans Steno post that my dad has had significant hearing loss for as long as I can remember, but since he was largely in denial about it for most of that time, I didn't learn much about how hearing loss worked or what it was like to deal with in daily life, so I'm ashamed to admit that I've been staggeringly insensitive to at least two hard of hearing people I've known over the years.

The first was when I was a teenager. Her parents were friends with my parents, so she and her brother used to come over to my house when we had parties, and sometimes I'd go to theirs. We went to different schools, so we only saw each other a few times a year, but we always had a good time when we got together. I didn't notice that she wore hearing aids until several years after we first met, when she mentioned that she'd won an essay scholarship for teenagers with hearing loss. Like the ignorant blunderer I was, I said, "Wait, you have hearing loss?" For the first time I noticed the aids. "But you wear hearing aids." "Yes," she replied patiently. "So... Why do you qualify for a scholarship if your hearing aids have already fixed the problem?" Like so many people, I'd assumed that if my eyeglasses were able to correct my severe myopia to normal vision, then hearing aids would be able to do the same thing for anything short of total deafness. I had no idea until almost 20 years later that amplification often doesn't improve clarity, that some frequencies can be incapable of amplification due to permanent loss of specific cochlear hair cells, that hearing is an extremely complex mechanism that doesn't have an easy or complete fix when any of its components malfunction. My misrefracting cornea could be completely compensated for by a piece of light-bending plastic. Even with hearing aids, my friend's hearing loss remained something she needed to reckon with.

I'd never noticed her misunderstanding me or asking me to repeat myself when we talked (the way my dad often did), and I didn't realize that the casual one-on-one conversations we had at parties were totally unlike her situation in the classroom, where she was learning new material, sat several feet away from the teacher (losing any ability to lipread, especially since the teacher faced the board most of the time), and was forced to work twice as hard as her classmates to get the same amount of information through her ears and into her brain. The fact that my friend had managed to do this all her life, getting excellent grades and becoming an extremely literate and eloquent writer, totally blew past me. I took it for granted; instead of congratulating my friend on her essay, I was rude and dismissive. I haven't seen my old friend since high school, but if I ever run into her again, I'll apologize and explain that I know a lot more now than I did then -- not that that's any excuse. If I had actually asked her to tell me more about the scholarship instead of assuming that it made no sense, I would have learned something that day, instead of having to wait 20 years to realize how much of a jerk I'd been.

The second incident is even more problematic, because I was in a position of authority. At my college, all sophomores are required to take a year of music theory, even though their degree (there's only one on offer) is in Liberal Arts. Music classes are led by professional instructors, but there are also weekly practicum classes, where students are supposed to try out what they've learned in small 4-to-5-person groups. Students with musical experience are chosen to lead those groups as work-study assignments, and because I'd played in the pit orchestra of a summer repertory theater, I got to be one of them. My job involved drilling the students in singing simple multipart songs and rounds, helping them to analyze counterpoint examples discussed in class, and answering any questions they had about the stuff they were studying. The emphasis was on getting an intellectual understanding of the music rather than in becoming accomplished performers, so it wasn't a problem that a few students in each practicum were tone deaf. Most of them just hadn't been exposed to much formal music training, and once I gave them a few exercises, their pitch discrimination and singing tended to improve quite a bit.

There was one student, though, who found both the music class and the practicum intensely frustrating. I noticed his hearing aids right away, because he'd decorated the earmolds in bright colors. He was forthcoming about his hearing loss, and explained that he got very little out of all the singing, analysis, and call-and-response pitch practice, because he couldn't hear any of it accurately enough to duplicate. Again, I assumed that the hearing aids should have solved the problem, and didn't understand what his issue was. He wasn't being graded on his accuracy in singing, and he wouldn't be penalized if he wasn't able to appreciate the aesthetic nuances of the songs. All he had to do was understand the mathematics of the music on the page, so that he could speak about it in class. The singing exercises were just intended to help build first-hand experience with hearing and repeating music in realtime. I figured that his hearing loss put him in the category of the "tone deaf" students, and treated him accordingly. I didn't realize that, unlike them, the problem wasn't his ability to distinguish the notes on an intellectual level. Unlike them, he wasn't going to improve with practice. He couldn't hear the difference between pitches no matter how many times they were repeated, so he felt like he was being forced to bang his head against a wall every week in practicum. When he expressed his frustration to me, I thought he was being oversensitive, and just reassured him that it wouldn't affect his grade even if he didn't improve by the end of the semester. I didn't realize the emotional consequences of being asked to do something you weren't physically able to do every week in front of your peers, over and over, and failing every time. He eventually wound up transferring to another college, and I'm afraid that my inability to understand what he was telling me played into that decision.

Like any good essay writer, I've Googlestalked both of these people as research for this post, and today they're both extremely successful and well-respected in their fields. Obviously my ignorance didn't stop them from doing what they wanted to do. But when you add my ignorance to the ignorance of everyone else they had to deal with, how much more exhausting, frustrating, annoying, infuriating did it make their educational experiences, not to mention other parts of their lives? If I hadn't gone into CART, I never would have realized the mistakes I'd made in trusting my own assumptions instead of listening to their experiences. Now I do, and I'm mortified when I think of the way I behaved. There's no easy solution to this problem. One out of every seven people in this country have some degree of hearing loss, and yet so few people actually understand how it works. It'll take a lot to educate all 312 million people about the 45 million who are Deaf, deafened, or hard of hearing, but it badly needs to be done.

Thursday, December 6, 2012

CART Problem Solving: Speech Recognition Part IV

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

This video is only tangentially relevant to the post; I just found it adorable.

At long last, the final Speech Recognition installment.

CART PROBLEM:Speech recognition is almost always slower and less accurate than stenographic text entry, but there's a strong cultural push to use it, because it's perceived as cheaper and less complicated than hiring a qualified CART provider.

In the previous three posts, I discussed why speech recognition isn't yet capable of providing accurate text when presented with untrained multispeaker audio. I also spoke a bit about why the common assumption that it would only take a bit more development time and processing power to get to 100% accuracy is based on a misunderstanding of how language works and how speech recognition engines try to capture it.

Just because a lizard can play that bug-squishing iPhone game, it doesn't follow that upgrading the lizard to a cat will make it a champion at Dance Dance Revolution. A bigger speech corpus, faster computers, and even a neural-network pattern matching model still doesn't make up for the essential difference between human and mechanized speech recognition: Humans are able to make use of context and semantic cues; computers are not. Language is full of soundalike words and phrases, and imperfect audio is very much the rule and not the exception in most real-world situations. This means that humans will inevitably have the edge over computers in differentiating ambiguous sound patterns, and the improvements in speech recognition technology will follow an asymptotic trajectory, with each new improvement requiring vastly greater effort to achieve, and the final goal of accurate independent transcription a nearly impossible one, except in controlled settings, with a narrow range of speakers and vocabulary.

But of course there's a huge difference between a professional voice writer and an untrained one, and an even greater difference between any kind of respeaking system and a speaker-independent speech transcription program. Despite widespread public perception, voice writing isn't actually any easier to do than CART, and in fact is usually quite a bit harder in most circumstances.

The supposedly short training period is voice writing's major selling point over steno (aside from the cost of equipment), but from what I can tell, it's not actually true. You can train someone to a moderate degree of accuracy very quickly; all they have to do is speak into the microphone slowly and clearly, and it'll get a fair amount of words correct. For dictation or offline transcription, this can work well, assuming they have the stamina to speak consistently for long periods of time, because they can speak at a slow pace, stop, go back, and correct errors as they make them. Obviously, the closer a person's voice is to the standard paradigm (male, American, baritone), the better results they'll get. Many people with non-standard voices (such as this deaf female blogger) have a heck of a time getting software to understand them, even speaking as slowly and clearly as they can manage. But even for men with American accents, actual live realtime respeaking at CART levels of accuracy (ideally over 99% correct) is much, much harder than dictation.

* Short words are more difficult for the speech engine to recognize than multisyllabic words are, and are more likely to be ignored or mistranscribed.

* If the voice captioner does mostly direct-echo respeaking, meaning that they don't pronounce common words in nonstandard ways, they have to repeat multisyllabic words using the same number of syllables as in the original audio; if they try to "brief" long words by assigning a voice macro that lets them say the word in one syllable, they run up against the software's difficulty in dealing with monosyllabic words that I mentioned above.

* Because they're mostly saying words in the same amount of time as they were originally spoken (unlike in steno, where a multisyllabic word can be represented by a single split-second stroke), they don't have much "reserve speed" to make corrections if the audio is mistranscribed. They also have to verbally insert punctuation and use macros to differentiate between homonyms, which takes time and can be fatiguing.

* Compensating for the lack of reserve speed by speaking the words more quickly than they were originally spoken can be problematic, because the software is better able to transcribe words spoken with clearly delineated spaces between them, as opposed to words that are all run together.

* This means that if the software makes a mistake and the audio is fairly rapid, the voice captioner is forced to choose between taking time to delete the mistake and then catching up by paraphrasing the speaker, or to keep up with the speaker while letting the mistake stand.

* The skill of echoing previously spoken words aloud while listening to a steady stream of incoming words can be quite tricky, especially when the audio quality is less than perfect; unlike simultaneous writing and listening, simultaneous speaking and listening can cause cross-channel interference.

This doesn't even go into the potential changes in a person's voice brought about by fatigue, allergies, colds, or minor day-to-day variations, all of which can wreak havoc with even a well-trained voice engine.

Low or moderate accuracy offline voice writing = short training period; most people can do it.

Low or moderate accuracy realtime voice writing = somewhat longer training period; machine-compatible voice timbre and accent required.

CART-level accuracy realtime voice writing = extremely long training period; an enormous amount of talent and dedication required.

I want to emphasize again that none of this is meant to denigrate the real skill that well-trained voice writers have developed over their years of training. It's just to point out that while voice writer training seems on the surface to be easier and quicker than steno training, that's very seldom the case in practice, as long as appropriate accuracy standards (99% or better) are adhered to. The problem comes in when the people paying for accommodations, either due to a shortage of qualified steno or voice writers, or due to cost considerations, decide that 95% or lower accuracy is "good enough" and that deaf people should be able to "read through the mistakes".

So let's talk about some other potential competitors to CART services. These fall into two general categories: Offline transcription and text expansion. I think I'll leave text expansion for a future series of posts, since it's a fairly complex subject. Offline transcription is much simpler to address.

I've seen several press releases recently from companies bragging about contracts they've secured with universities, that claim to offer verbatim captioning at rock bottom prices. The catch is that the captioning isn't live. No university or conference organizer I know of is foolhardy enough to set completely automated captions up on a large screen in front of the entire audience for everyone to see. The mistakes made by automated engines are far too frequent and hilarious to get away with. But they will, it seems, let lectures be captured by automated engines, then give the rough transcripts to either in-house editors (mostly graduate students) or employees of the lecture-capturing companies, to produce a clean transcript at speeds that are admittedly somewhat better than they used to be, back when making a transcript or synchronized caption file offline usually involved a qwerty typist starting from scratch.

I'm worried that this is starting to be perceived as an appropriate accommodation for students with hearing loss, because there's a crucial piece missing from the equation: Realtime access. Imagine a lecture hall filled with 250 students at a well-regarded American private university, sitting with laptops and notebooks and audio recorders, facing the PowerPoint screen, ready to learn. It's Monday morning. In walks the professor, who pulls up her slideshow and begins the lecture.

PROFESSOR: Tanulmányait a kolozsvári zenekonzervatóriumban, majd a budapesti Zeneakadémián végezte, Farkas Ferenc, Bárdos Lajos, Járdányi Pál és Veress Sándor tanítványaként. Tanulmányai elvégzése után népzenekutatással foglalkozott. Romániában ösztöndíjasként több száz erdélyi magyar népdalt gyűjtött.

After a few seconds, the students start looking at each other in confusion. They don't speak this language. What's going on? The professor continues speaking in this way for 50 minutes, then steps down from the podium and says, "The English translation of the last hour will be available within 48 hours. Please remember that there is a test on Wednesday."

These students are paying $50,000 or $60,0000 a year to attend this school. They're outraged. Not only do they have less than 24 hours to study the transcript before the test, but they were unable to ask questions or to see the slides juxtaposed with the lecture material. Plus they just had to sit there for 50 minutes, bored and confused, without the slightest idea of what was going on. It wouldn't stand. The professor would be forced to conduct future lectures in English rather than Hungarian, or risk losing her job. This is the state of affairs for deaf and hard of hearing students offered transcripts rather than live captioning. It deprives them of an equal opportunity for learning alongside their peers, and it forces them to waste hours of their life in classes that they can't hear and therefore can't benefit from. I'm waiting for the day when the first student accommodated in this way sues their school for violating the Americans with Disabilities Act, and at that point the fast-turnaround transcript and captioning companies are going to be in a good deal of trouble. There is the possibility of training realtime editors who might be able to keep up with the pace of mistakes and correct each error a few seconds after it's made before the realtime is delivered to the student, but that adds yet another person into the workflow, reducing the savings the university was hoping to get when they laid off their CART providers. In some classes, a relatively untrained editor with a qwerty keyboard will be able to zap the errors and clean up the transcript in realtime, but in others -- where the professor doesn't speak Standard Male American (true for a significant and increasing number of professors in the US college system), or there's too much technical jargon, or the noise of the ventilation system interferes with the microphone, or any of a hundred other reasons -- the rate of errors made by the speech engine will outpace the corrections any human editor can make in realtime.

So what lies ahead in the future? Yes, speech recognition engines will continue to improve. Voice writer training times might decrease somewhat, though fully accurate automated systems will stay out of reach. People don't realize that speech is an analogue system, like handwriting. Computer recognition of the printed word has improved dramatically in the past few decades, and even though transcripts produced via OCR still need to be edited, it's become a very useful technology. Recognition of handwriting has lagged far behind, because the whorls and squiggles of each handwritten letter varies drastically from individual to individual and from day to day. There's too much noise and too little unambiguous signal, apart from the meaning of the words themselves, which allows us to decipher in context whether the grocery list reads "buy toothpaste" or "butter the pasta". Human speech is much more like handwriting than it is like print. Steno allows us to produce clear digital signals that can be interpreted and translated with perfect accuracy by any computer with the appropriate lexicon. Speech is an inextricably analogue input system; there will always be fuzz and flutter.

Monday, June 25, 2012

Sorry for the Radio Silence

Apologies again for the last several weeks of no posts. I had today's blog post all sketched out, but then a situation came up and I don't think I'll be able to actually write it today. I'm currently helping a family member through an ongoing crisis, and it's soaking up a lot of my posting time. Hopefully I'll be able to get back on track soon.

Monday, June 4, 2012

Taking a Mulligan

Hey, guys. I'm really sorry, but Speech Recognition IV is going to have to wait until next week. I've just got too much to do, preparing for my three-hour Steno Crash Course and Plover Programming Sprint at PyGotham this Friday and Saturday. In the mean time, I'll tide over your desire for speech recognition schadenfreude with this:

Gazpacho Soup Day for Siri

Monday, May 28, 2012

CART Problem Solving: Speech Recognition Part III

CART Problem Solving Series

Sitting Apart
Handling Slides
Classroom Videos
Latin
Superscript and Subscript
Schlepping Gear
Late Hours
Expensive Machines
Communicating Sans Steno
Cash Flow
Lag
Summer
Test Nerves
Ergonomics
Speech Recognition, Part I
Speech Recognition, Part II
Speech Recognition, Part III
Speech Recognition, Part IV

Apologies for the lack of captioning in the first few seconds of the video, but I had to post it. It's a fantastic illustration of not just how often automatic speech recognition gets things wrong, but how wrong it tends to get them. There's a whole series of these Caption Fail videos, but this is the most work safe (and, in my opinion, funniest) of the lot.

See, because computers aren't able to make up for lossy audio by filling in the gaps using semantic and contextual clues, they make mistakes that a human transcriber would never make in a million years. On the one hand, you get "displaces" instead of "this place is". That's reasonable enough, and out of context a human might have made that mistake. But when a human hears "fwooosssssh" as a man tries to imitate the sound of the ocean with his mouth, the computer continues to try to read it as speech, and translates it "question of." Not only was it not able to differentiate between words and sound effects, but "fwoooosh" doesn't sound anything like "question of." The algorithms that computers use to match similar sound patterns to each other are so alien to our way of thinking that, unlike mistakes made by humans, we can't even hope to read through them to figure out what the correct version should have been.

I promised you some illustrations to use when trying to explain why accurate speaker independent automated speech recognition is not "just around the corner", despite the popular conception that it is. I think it's useful to consider your audience when trying to explain these. If you're talking to computer people, bringing in the parallels with OCR might be more effective than if you're talking to people who haven't used that sort of technology. If someone has never heard a beatboxer, my voicewriting to steno analogy comparing beat boxing to drumming won't mean much. Try to get an idea of the person's frame of reference first, and then construct your argument.

Belize, the Passport, and the Pixelated Photo

You know those procedural crime shows? Where the first 20 minutes is taken up with chasing seemingly disconnected clues, and then at last one of the detectives has a sudden flash of insight, puts all the pieces together, and knows where to find the suspect? Sometimes those shows do a very misleading thing. They'll get a blurry photo, something captured by a security camera or a helicopter, and the detective will say, "There! Zoom in right there!" Then the screen will shift and you'll get the same photo, except that the formerly blurry house will be nice and clear, and in the window... Is that a man with a gun? The detective will shout, "Zoom in on that flash of light there!" Again, the pixels will smear and redraw themselves, and look! Reflected in the man's glasses is another man in a Panama hat, wearing a mechanic's shirt with "Jerry" embroidered on the pocket and wielding a tire iron!

It's all very exciting, but it's also very wrong. If you take a blurry photograph and blow it up, you don't get a clearer view of its details; you just get a blurrier photograph with larger pixels. This is the essence of lossiness, but in an image rather than a sound. That visual information was lost when the photo was taken, and no amount of enhancement or sharpening tools will ever get it back. Unlike computers, humans are extremely good at inferring ways of filling in the gaps of lossy material, by using lateral clues. If the hard-won blurry photo of the criminal's coffee table just before he set his house on fire depicts a guide book, a passport, and a bottle of SPF 75 suntan lotion, a computer will either throw up its hands and say, "Not enough information found" or it will produce complete gibberish when trying to decipher its details. A human, on the other hand, will see that the letters on the guidebook, while extremely indistinct, seem to have approximately five spaces between them. The first letter is either an R, a P, or a B, and the last one is quite possibly an E. The passport shows that the criminal will be leaving the country, and the suntan lotion indicates that the location gets lots of sun. The savvy human detective paces around a bit and says -- I've got it! Belize! It's the only country that fits the pattern! Then they hop on the next flight and catch the criminal before the final credits.

The humans didn't need to fill in the gaps of the letters on the guide book in order to figure out the word it spelled, because they were able to use a host of non-textual clues to make up for the lossiness. Because computers programmed to recognize text aren't able to draw on all that subsidiary information, humans will always have an advantage in recognizing patterns, drawing inferences, and correcting errors caused by lossy input.

Why Hasn't Microphone Technology Improved More in 100 Years?

This leads me to the subject of speech recognition, which is a much thornier problem than text recognition. The answer to the question is simple: It has. Listen to an old Edison wax disc record and compare it to CD-quality sound produced by today's audio engineers, and you can hardly claim that recording technology has been stagnant. But the question behind my question is this: With all this fantastic audio recording technology, why is it still close to impossible to get quality audio in a room full of more than a handful of people? Make sure every speaker passes around a hand mic, or wears a lapel mic, or goes up to talk into the lectern mic, and you're fine. Put the best and most expensive multi-directional microphone on the market in the center of the room and half a dozen people sitting around a conference table, and you're sunk. Everyone sounds like they're underwater. The guy adjusting his tie near the microphone sounds like a freight train, while the guy speaking clearly and distinctly at the other end of the table sounds like he's gargling with marbles. Even $4,000 hearing aids have this problem. They're simply not as good as human ears (or, more accurately, the human brain) at filtering out meaningless room noises and selectively enhancing the audio of speakers at a distance. That's why onsite CART is often more accurate by several orders of magnitude than remote CART, no matter how much money is spent on microphones and AV equipment. When the bottleneck of sound input is a microphone, it's limited by its sensitivity, its distance from the speaker, and any interference between the two. That's the problem that still hasn't been solved, over a hundred years since the invention of recorded audio.

Having to transcribe terrible audio, guessing at omissions and listening to a five-second clip of fuzzy sound a dozen times before finally figuring out from context what it's actually about has been a real lesson in empathy for me. The frustration I feel in my home office, clicking the foot pedal over and over and listening to imperfect audio many times over, is nothing compared to what my hard of hearing clients feel in the course of their lives every day. They don't have a foot pedal to rewind the last few seconds of conversation, and they're not even getting paid to do this endlessly unrewarding detective work. I suppose I should feel even more sorry for the poor computers, who are trying to deal with substandard audio but don't have the luxury of lateral thinking or contextual clues or the ability to differentiate between soundalike phrases semantically. I've often wanted to rent out an hour of an audiologist's time and hook up the most popular commercial speech recognition software to their test system. I'd be very interested to see how it did. Of course, it could recognize all the tones perfectly well. It might even be all right at the individual words. But unlike a human with hearing loss, who usually does better at guessing words in the contexts of sentences than hearing them on their own, I bet you that the software would do considerably less well, and would probably come out with an audiogram that put them in the range of moderate to severe hearing loss, especially if any of the tests were given with stimulated noise interference mixed into the audio feed. I could be wrong, of course; I haven't yet gotten a chance to actually do this. But I'd be very interested to find out.

Well, this has run rather longer than I thought it would. I guess I'm going to have to do Speech Recognition Part IV next week. 'Til then!