URBANA, Ill. -- A complex computer system at the University of Illinois malfunctions because of bugs, and right away people start blaming Ravi K. Iyer and his colleagues, which is somewhat embarrassing for the director of the university's computer reliability center.
It is true that Mr. Iyer and his associates do deliberately corrupt data and otherwise introduce bugs into computers, but they don't do it perniciously. They are trying to learn how to monitor computers, predict abnormalities and find the quickest way to ,, recovery.
Any bugs afflicting the university's computer system aren't part of Mr. Iyer's studies, he has assured sometimes skeptical colleagues. "We put monitors on the system, and people immediately thought our monitors were causing problems," he said. "But we took our monitors out of the system, and it's failing without them."
Failing computer systems, the bane of high-technology, threaten become even more daunting as computer chips get smaller, more compact and more ubiquitous. With such a prospect, the work of the university's Center for Reliable and High Performance Computing grows in significance.
Established by the National Aeronautics and Space Administration to help assure the reliability of NASA's spaceflight computers, the center's work is turning more and more to aiding private enterprise.
A general goal is to develop means of diagnosing computer system weaknesses and to point to cost-effective ways of boosting reliability. Also, center researchers look for indicators in the mountain of data generated by computers that might show when failure is likely.
As they describe their work, the researchers often fall into medical analogies. Just as physicians look closely at a person's medical history before making a diagnosis, computer engineers look at a machine's history and patterns of failure when trying to determine what is most likely wrong with a system.
For NASA, the computer reliability problem was clear from the start. Engineers had to test computers on the ground so they would be reliable in space. But even when reliability is paramount, it is never assured. NASA's system of redundant computers for guiding the space shuttle proved an embarrassment on the Columbia's maiden voyage a decade ago.
A fluky malfunction that put the computers' timing functions a fraction of a second out of sync meant the machines couldn't communicate with each other and the blastoff was scrubbed.
After checking hardware and software, NASA engineers were stymied. The solution proved amazingly simple: turn off the machines and turn them on again. Synchrony was restored and the mission commenced.
"We still face such problems," Mr. Iyer said. "How does the buyer of a fault-tolerant system validate its reliability?"
Using a fault-tolerant computer, one designed to keep running even when some components fail, donated by Tandem Computers Inc., researchers at Mr. Iyer's center are doing pioneering work using Unix open-system software.
Many reliability features are straightforward. The mainframe has three fans to maintain its proper temperature. When one is removed, the other two automatically speed up to take up the slack. When the fan is back in service, its partners resume normal speed.
Also, it is possible to pull a logic circuit-board out of the computer without putting it out of service.
"This is a complex issue because the software has to be aware when a board goes offline," said Robert Horst, a Tandem systems architect who is doing graduate study at the university. "When you put a new board in, it has to be brought up to speed with the others on the fly. That is a tough problem."
Though software logic glitches are always a problem with computers, hardware problems are becoming more prominent as well, Mr. Iyer said. Packing more electronics into a smaller, tighter space on a single chip makes the system more vulnerable to such esoteric threats as cosmic rays, he said.
Also, the environment in which computer chips must function is becoming more harsh. Automobiles are being controlled more and more by chips that must regularly withstand extremes of heat and cold. Automated devices on factory floors may operate where flying sparks are part of the landscape.
"Power losses and power surges are a frequent cause of problems," said Mr. Horst. "Our systems carry backup batteries to take over in case of a power loss."
University researchers are developing computer systems to let computer designers simulate how their machines might work and help them spot weak points.