UCSC Review Winter 1995

What You See is What You Hear

If an elderly friend ever tells you that he hears the television better with his glasses on, consider this before you chuckle: It turns out that what we see when we listen is key to our ability to understand spoken language.

For thirteen years, psychology professor Dominic Massaro has led a team that has been exploring the mystery of speech comprehension by studying what is called visible speech--the way our faces move when we talk. He has developed a state-of-the-art computerized "talking head" that allows him to isolate visible speech and see how it combines with what we hear to produce the wonder of understanding.

The technology is now so advanced that Massaro's system could soon be utilized by the hearing impaired, speech therapists treating brain-injury patients, language instructors--even by technology wizards designing a "face-to-face" form of electronic mail.

Spoken language is a universal means of communication, yet relatively little is known about how people comprehend speech. "We take it for granted, but speech comprehension is an amazing accomplishment," says Massaro. "No computer has been programmed to understand speech as well as a three-year-old child."

Auditory cues clearly contribute to speech comprehension, but hearing alone is not enough. Lipreading illustrates the value of visible speech to the hearing impaired, but it is also important to those who have not suffered hearing loss. In laboratory studies, people with normal hearing are able to comprehend about 25 percent of a message when they rely solely on lipreading; those who receive only audio signals in a noisy environment like a cocktail party do as poorly. The bottom line is that comprehension soars when auditory and visual cues are combined: When the same research subjects lip- read and receive audio messages, the rate of comprehension jumps to about 80 percent.

"We use visual and auditory cues simultaneously to make sense of what people are saying," explains Massaro. "Any one cue alone generally isn't enough."

Massaro's "talking head" enables researchers to isolate those visual and auditory cues to study the effects of each. Developed by Massaro and research associate Michael Cohen from earlier work by Fred Parke at the University of Utah, the three-dimensional image resembles a mannequin, with moving eyes, brows, and mouth. Massaro types in English text, and the image on the computer screen "talks." An underlying grid allows researchers to manipulate the jaw, lips, and tongue to mimic human speech. "It's like the face is a puppet and we've got 60 strings we're controlling it with," says Massaro. The latest development is texture mapping, which allows Massaro's team to wrap any still video picture over the framework to produce a more natural image.

Researchers can also mix and match audio and video cues. In one experiment, they programmed the talking head to say "doll," and dubbed it with an auditory recording of the word "ball." The result? Most people thought the head said "wall." Similarly, an auditory recording of the nonsense sentence, "My bab pop me poo brive," dubbed onto a video of the talking head saying "My gag kok me koo grive," makes viewers hear "My dad taught me to drive."

In probing how listeners make sense of such gibberish, Massaro says his team is uncovering fundamental rules about the way the mind assimilates language.

Massaro's theory begins with the assumption that speech comprehension is not an innate skill. Unlike those who believe that speech is highly specialized and originates in a particular section of the brain, Massaro asserts that people learn speech just as they learn other skills. "Humans use multiple sources of information to understand spoken language," he says. "In that way, speech is like other forms of pattern recognition and categorization."

Massaro explains that humans learn to recognize objects by looking for a variety of attributes, rather than by forming a specific mental template. For example, although there is tremendous variety in the types of telephones available, people recognize the objects as phones by identifying features such as the number pad and the receiver. That skill enables people to identify telephones of different sizes, shapes, or colors even when they appear out of context. It is this process of combining attributes that Massaro believes is analogous to how humans comprehend speech--by combining visual and auditory cues.

That same ability helps researchers understand why people believe the talking head is saying "wall," when the visual image is pronouncing "doll" and the sound is "ball." "People are always trying to impose meaning at the highest possible level, even when they're given conflicting or ambiguous information," explains Massaro. "Although you might expect people to ignore either the sound or the visible speech, in fact they will try to put all of the pieces together in the way that makes the most sense."

Massaro's lab is one of only a handful of facilities around the world that are using facial animation to explore speech comprehension. His scientific contributions are matched by the potential for exciting practical applications in several arenas.

Massaro envisions his talking head being used to translate printed books into visual speech for the hearing impaired. In its transparent form, the head could show language students and patients in speech rehabilitation the precise position of the tongue during the formation of sounds they don't have in their native language or that they've lost through injury. Eventually, Massaro predicts, computerized talking heads will be commonplace at home and work.

Although no one has shown that people comprehend and retain more of what they learn from reading than from listening, society's bias in favor of the written word could slow the widespread adoption of talking-head technology. But other pressures could also come into play, notes Massaro.

"Society may get to the point where universal literacy is abandoned because it's considered too expensive," he says. "I can imagine a day when people might conclude that reading is not a natural act because it requires formal instruction--unlike spoken language, which infants develop at a very young age. In that case, this technology could provide an alternative to books. People could watch and listen to novels, instead of reading them.

--Jennifer McNulty