Help Quick Links Directory Search Sitemap A-Z Index Resources Research Partnerships News & Events Admissions Administration Academics General Info UC Santa Cruz Home Page UCSC NAV BAR

Press Releases

September 9, 1994 Contact: Jennifer McNulty (408/459-2495)

UC SANTA CRUZ TEAM PIONEERS SPEECH COMPREHENSION RESEARCH

Development of computerized "talking head" allows researchers to explore the role of visible cues, which combine with auditory elements to facilitate recognition of spoken language

FOR IMMEDIATE RELEASE

SANTA CRUZ, CA--Spoken language is a universal means of communication, yet relatively little is known about how people comprehend speech. Auditory cues clearly play a role in speech comprehension, with variables such as pitch, duration, and loudness contributing to the understanding of verbal messages.

But a team of researchers at the University of California, Santa Cruz, is exploring another aspect of verbal communication: visible speech. Led by psychology professor Dominic Massaro, the team has investigated the critical role played by what we see when we listen, as well as what we hear.

The ability of the hearing impaired to augment their hearing with lip reading is an everyday indication of the value of visible speech, which is also important to those who have not suffered hearing loss. "You've probably heard elderly friends say that they hear the television better with their glasses on," says Massaro. "That's because visual cues are important to our ability to understand speech. We're constantly processing signals we pick up visually, as well as those we pick up aurally."

Research reveals that listeners who rely solely on lip reading have a comprehension rate of about 25 percent; those who receive only audio signals in a noisy environment like a cocktail party have a similar rate of comprehension. However, when the same listeners lip read and receive audio messages, the rate of comprehension jumps to about 80 percent.

"We take it for granted, but speech comprehension is an amazing accomplishment," says Massaro, who began exploring the link between auditory and visible speech about thirteen years ago. "No computer has been programmed to understand speech as well as a three-year-old child."

Massaro's team has developed sophisticated computer technology to study how people perceive and recognize speech by eye and how they combine these perceptions with what they hear. Massaro and research associate Michael Cohen have created a computerized "talking head" that produces synthetic speech, enabling researchers to isolate visual and auditory cues received by listeners.

The three-dimensional computerized image resembles a mannequin, with moving eyes, brows, and mouth. In full color, the face is shaded to look more realistic and the features move in real time. The underlying grid allows researchers to control about 60 parameters to animate the face and create the movements of speech.

Using the computer to produce auditory synthetic speech gives the researchers control that is not always possible with natural speech. The animated face allows them to manipulate precise movements of the face--including the jaw, lips and tongue--that make up the visible components ofspeech. Synthetic speech also allows researchers to produce novel sounds or ambiguous syllables- -precisely halfway between "ba" and "da," for example--which aid them in their investigations. For example, researchers can program the talking head to say "doll," and dub it with an auditory recording of the word "ball." The result? Most people watching the talking head on a television monitor will hear "wall." Similarly, if a researcher makes an auditory recording of the nonsense sentence, "My bab pop me poo brive," and dubs it onto a video of the talking head saying "My gag kok me koo grive," viewers will hear "My dad taught me to drive."

Only a handful of labs around the world are studying facial animation, and even fewer are using it to probe the mystery of speech comprehension. Massaro's lab is one of the few facilities to combine a talking head with synthetic voices. Massaro's state-of- the-art system, known as a text-to-speech system, allows researchers to type in English text, which the computer produces as spoken language, complete with corresponding facial movements. "We think we're uncovering fundamental rules about the way the mind works with language," says Massaro.

Massaro envisions his "talking head" being used to provide the hearing impaired with the same visual cues that are produced during natural speech. Massaro sees the potential for translating printed books automatically into visual speech for the hearing impaired, just as books are translated by auditory speech synthesizers for the visually impaired. "Once we understand enough about how we recognize speech, we should be able to program visible speech synthesizers to translate an auditory message into one that shows a talking head, which would be accessible to those who can't hear," says Massaro, whose work has been supported in part by the National Institute of Deafness and Communicative Disorders.

Ultimately, the technology may also be used in the learning of second languages, in speech therapy for brain-injury patients, and in the next generation of computer communication as a "face-to-face" form of electronic mail, for example. "We can make the head transparent so students and patients can see through the cheeks to see the precise position of the tongue during the formation of sounds they don't have in their native language or that they've lost through injury," says Massaro. "Eventually, I could see computerized talking heads capable of expressing emotion becoming commonplace at home and work."

In the lab, Massaro and his team have developed experiments to test the interaction of visible and auditory speech, and they have probed how it is that listeners make sense of gibberish sentences, such as "My bab pop me poo brive." Massaro's theory begins with the assumption that speech comprehension is not an innate skill. Departing from those who believe that speech is highly specialized and originates in a particular section of the brain, Massaro asserts that people learn speech just as they learn other skills. "We believe that humans use multiple sources of information to understand spoken language," he says. "In that way, speech is like other forms of pattern recognition and categorization, which integrate multiple sources of information. It appears to be a natural function of human endeavor."

Massaro explains that humans learn to recognize objects by looking for a variety of attributes, rather than by forming a specific mental template. For example, although there is tremendous variety in the types of telephones available, people recognize the objects by identifying features such as the number pad and the receiver. That skill enables people to recognize telephones of different sizes, shapes, or colors--even when they are out of context. It is this process of combining attributes that Massaro believes is analogous to how humans comprehend speech--by combining visual and auditory cues.

That same ability helps researchers understand why people believe the talking head is saying "wall," when the visual image is pronouncing "doll" and the sound is "ball," says Massaro. "People are always trying to impose meaning at the highest possible level, even when they're given conflicting information," he explains. "Although you might expect people to ignore either the sound or the visible speech, in fact they use all the evidence and come up with the best solution. When there is inconsistent or ambiguous information, people will try to put all of the pieces together in the way that makes the most sense."

Massaro acknowledges society's bias in favor of the written word, although he is quick to challenge it. "Although people would like to say otherwise, nobody has shown that people comprehend and retain more of what they learn from reading than listening," says Massaro. "Literature made civilization because of its permanence and the ability to pass on knowledge. If electronics had evolved rather than stone tablets, I don't think there'd be much difference."

####

Editor's note: Professor Massaro can be reached at (408) 459-2330. A videotape of Massaro's "talking head" is available through the UCSC Public Information Office; call Jennifer McNulty at (408) 459-2495 for more information.

(This release is also available on UC NewsWire, the University of California's electronic news service. To access by modem, dial 1- 209-244-6971.)

Figure 1: Polygons make up the framework (left) of the "talking head" developed by psychology professor Dominic Massaro and research associate Michael Cohen. The photo at right shows the head as it is used in research studies.

Figure 2: The above series illustrates the face at the onset of pronunciation of the syllables "ba" and "da." The three images in the center represent the ambiguous syllables in the continuum between "ba" and "da" that humans are unable to produce.



Press Releases Home | Search Press Releases | Press Release Archive | Services for Journalists

UCSC nav bar

UCSC navbar


Maintained by:pioweb@cats.ucsc.edu