November 29, 2004|By Michael Stroh | Michael Stroh,SUN STAFF
Shiva Sundaram spends his days listening to his computer laugh at him. Someday, you may know how it feels.
The University of Southern California engineer is one of a growing number of researchers trying to crack the next barrier in computer speech synthesis - emotion. In labs around the world, computers are starting to laugh and sigh, express joy and anger, and even hesitate with natural ums and ahs.
Called expressive speech synthesis, "it's the hot area" in the field today, says Ellen Eide of IBM's T.J. Watson Research Center in Yorktown Heights, N.Y., which plans to introduce a version of its commercial speech synthesizer that incorporates the new technology.
It is also one of the hardest problems to solve, says Sundaram, who has spent months tweaking his laugh synthesizer. And the sound? Mirthful, but still machine-made.
"Laughter," he says, "is a very, very complex process."
The quest for expressive speech synthesis - melding acoustics, psychology, linguistics and computer science - is driven primarily by a grim fact of electronic life: The computers that millions of us talk to every day as we look up phone numbers, check portfolio balances or book airline flights might be convenient but, boy, can they be annoying.
Commercial voice synthesizers speak in the same perpetually upbeat tone whether they're announcing the time of day or telling you that your retirement account has just tanked. David Nahamoo, overseer of voice synthesis research at IBM, says businesses are concerned that as the technology spreads, customers will be turned off. "We all go crazy when we get some chipper voice telling us bad news," he says.
And so, in the coming months, IBM plans to roll out a new commercial speech synthesizer that feels your pain. The Expressive Text-to-Speech Engine took two years to develop and is designed to strike the appropriate tone when delivering good and bad news.
The goal, says Nahamoo, is "to really show there is some sort of feeling there." To make it sound more natural, the system is also capable of clearing its throat, coughing and pausing for a breath.
Scientist Juergen Schroeter, who oversees speech synthesis research at AT&T Labs, says his organization wants not only to generate emotional speech but to detect it, too.
"Everybody wants to be able to recognize anger and frustration automatically," says Julia Hirschberg, a former AT&T researcher now at Columbia University in New York.
For example, an automated system that senses stress or anger in a caller's voice could automatically transfer a customer to a human for help, she says. The technology also could power a smart voice mail system that prioritizes messages based on how urgent they sound.
Hirschberg is developing tutoring software that can recognize frustration and stress in a student's voice and react by adopting a more soothing tone or by restating a problem. "Sometimes, just by addressing the emotion, it makes people feel better," says Hirschberg, who is collaborating with researchers at the University of Pittsburgh.
So, how do you make a machine sound emotional?
Nick Campbell, a speech synthesis researcher at the Advanced Telecommunications Research Institute in Kyoto, Japan, says it first helps to understand how the speech synthesis technology most people encounter today is created.
The technique, known as "concatenative synthesis," works like this: Engineers hire human actors to read into a microphone for several hours. Then they dice the recording into short segments. Measuring in the milliseconds, each segment is often barely the length of a single vowel.
When it's time to talk, the computer picks through this audio database for the right vocal elements and stitches them together, digitally smoothing any rough transitions.
Commercialized in the 1990s, concatenative synthesis has greatly improved the quality of computer speech, says Campbell. And some companies, such as IBM, are going back to the studio and creating new databases of emotional speech from which to work.
But not Campbell.
"We wanted real happiness, real fear, real anger, not an actor in the studio," he says.
So, under a government-funded project, he has spent the past four years recording Japanese volunteers as they go about their daily lives.
"It's like people donating their organs to science," he says.
His audio archive, with about 5,000 hours of recorded speech, holds samples of subjects experiencing everything from earthquakes to childbirth, from arguments to friendly phone chat. The next step will be using those sounds in a software-based concatenative speech engine.
If he succeeds, the first customers are likely to be Japanese auto and toy makers, who want to make their cars, robots and other gadgets more expressive. As Campbell puts it, "Instead of saying, `You've exceeded the speed limit,' they want the car to go, "Oy! Watch it!"