In a few years, when you curse at your computer, don't be surprised if it tries to obey your command.
After decades of work, computer scientists are coming closer to realizing one of the discipline's most sought-after goals: creating computers that respond to natural human speech.
There are already a few products that let computers "understand" the spoken word. Most of them understand only one or two people who have taken time to "train" the system to understand their style of speech. But more advanced systems, ones that can understand anyone speaking to them without specific training, appear on the verge of moving from laboratory to market.
AT&T announced early this month that it planned to replace thousands of telephone operators over the next few years with computers that can understand the instructions of people trying to place a collect or person-to-person call.
For the past several weeks, Apple Computer has been publicly demonstrating a Macintosh-based speech recognition system dubbed Casper, after the friendly cartoon ghost, that lets a Macintosh personal computer accept spoken commands from any user.
The system, which the company claims has a vocabulary of 40,000 to 50,000 words that can activate pre-written command "scripts," reportedly may be offered for sale as early as the end of this year.
And researchers at International Business Machines Corp. are working on advanced speech recognition systems for personal computers and work stations to allow them, among other things, to take dictation. SRI International in Menlo Park, Calif., is developing its own system that could be used to take a wide variety of dictation.
Speech recognition, along with pen control, is seen by experts as playing a crucial role in the development of personal digital assistants, hand-held communications and information-retrieval devices. These devices could respond to spoken commands and would render keyboards largely unnecessary.
With advances like these, it's easy to see why many people might believe we're only a few years away from the computers of science fiction, like HAL of "2001," which routinely conversed with its human companions. But researchers say that, given the complexities of the problem, such a scenario is still decades away.
"Speech recognition capability is coming along, as far as discernment of sounds," said David Roe, who's been involved in speech recognition research for a decade at AT&T's Bell Laboratories. "What's hard is language understanding, what sentence is likely to be made out of those sounds."
Speech recognition shares much in common with two other vexing problems in computer science: handwriting recognition and computer vision systems. All three fields require computers to analyze huge amounts of data at high speed, searching for meaningful patterns in the enormous digital clutter.
The difficulty in achieving the results of science fiction stories is that even the most powerful computers, which can calculate far faster than any human, are a poor match for the human brain when it comes to the complex task of making sense out of the constant stream of patterns in speech.
The first step in speech recognition is what researchers call "feature extraction." Essentially, it involves converting the sound waves entering a microphone into digital form, compressing the raw stream of bits into a more compact manageable size, and extracting meaningful spoken sounds while rejecting background noises.
"There's so much information that's irrelevant," said David Nahamoo, manager of speech recognition modeling at IBM's Thomas J. Watson Research Center in Hawthorne, N.Y. "Pitch, for example, doesn't mean anything in our current system."
Once the computer extracts meaningful sounds, it must them assemble them into one of the "phonemes" that comprise the basic components of English speech from which all words are built. This is generally done by matching successions of sounds with a library of phonemes.
While the English language consists of 46 to 48 phonemes, depending on how they are defined, individual differences in pronunciation, from subtle ones to pronounced accents, also complicate matters. Speech recognition systems built to understand many speakers may actually have to store thousands of phoneme variations in their libraries to handle these differences.
Once the phonemes are extracted, the computer begins its most difficult process: assembling them into precise English words and sentences the computer can then act upon according to pre-determined rules. This "grammar processing," Mr. Roe said, involves one of speech recognition's most persistent problems, creating a working mathematical model of language.
"That is a tough nut to crack," he said. The language model, for example, must know what words are likely to follow others in English, to help differentiate among similar sounds -- but it's still an imprecise art.