Computers enlisted to decipher census data

Method: The Census Bureau has high expectations for the handwriting recognition software that has replaced the human eye.

April 24, 2000|By Ian Austen | Ian Austen,New York Times News Service

In his 30-year career, Richard Taylor, an information technology architect with Lockheed Martin Mission Systems, has taken on some extremely complex jobs.

He has helped turn the FBI's fingerprint files into an electronic database, designed systems for airlines that make sure there is a gate free for every arriving flight, and developed an archive of high-quality digital images for the National Gallery of Art.

But nothing that Taylor has done approaches his current task: equipping computers to make sense of the scribbles on 1.5 billion pages from U.S. census forms, which were sent to more than 121 million households.

The 2000 census will be the first to use computer software rather than human eyes to decipher and record information. Taylor called it "the biggest data-capture project in history," the Super Bowl of handwriting recognition.

Because the Census Bureau must report its findings by the end of the year, Taylor's system has about 100 days to do its job, starting with the official Census Day, April 1.

When you consider that many people cannot even read their own handwriting, the assignment seems that much more daunting.

In the last two censuses, in 1980 and 1990, getting those words off paper and into computers started with automated cameras built by the Census Bureau.

Mechanical arms in the cameras opened the forms and smoothed the pages flat so they could be photographed for microfilm. Computers then scanned the microfilm to make note of which boxes had been checked on the forms. Seven thousand clerks, working around the clock at seven processing centers, looked at the forms and entered the remaining handwritten information into computers.

When the 1990 census ended, the Census Bureau was determined to have computers do more of the job themselves, said J. Gary Doyle, the bureau's systems integration manager.

"We were pretty sure it would work," Doyle said. "It was just a question of how well it would work. But even if we got, say, 50 percent with character recognition, that's 50 percent less keyers needed."

But the Census Bureau will not be satisfied with a 50 percent accuracy rate from Lockheed Martin Mission Systems, a unit of Lockheed Martin Corp., which won the contract in 1997. The bureau wanted 98 percent accuracy, about 3 percentage points higher than the best accuracy rate people can achieve when typing in the information.

In 1997, Taylor had his choice of about seven handwriting recognition programs. All of them operated in about the same way, by using mathematical probabilities and pattern matching to figure out whether, for example, an individual letter was a C or an O that had not been quite closed.

He settled on recognition software produced by a German company, CGK Computer Gesellschaft Konstanz, partly because Lockheed Martin's engineers had used it in other projects.

But to reach the Census Bureau's accuracy target, character recognition alone was not going to be enough. Taylor's group decided that it had to develop custom software to identify whole words.

Eventually, the software engineers took three approaches. The simplest one was to develop vast dictionaries of place names, street names and occupations from past census data and from sources like the U.S. Postal Service.

A second piece of software cross-checks data that appear more than once. For example, it makes sure that the age given corresponds to the date of birth entered elsewhere on the form.

The final piece of word-reading software uses a more arcane approach called trigram analysis. The software team created a list of every possible three-letter combination that can be made from the alphabet. Those triplets were then compared with a database containing all the words in a large English-language dictionary.

From that comparison came tables that indicate the likelihood of any particular letter combination appearing at the beginning, in the middle or at the end of a word. Using that, the software compares the triple-letter combinations within words (identified as words by the other recognition software) and adjusts the accuracy confidence ratings accordingly.

For example, Taylor said that if the software scanned his surname, it would increase its assessment of the recognition software's accuracy after finding that "tay" was followed by "ayl." By contrast, the system might reject any combination of letters that produced the result "zkx."

While the software group was working, another team developed a system to minimize the amount of paper handling at the four data processing centers built by Lockheed Martin Mission Systems and the Census Bureau. At their busiest, the centers will each handle about 17 tractor-trailer loads of forms daily.

Each form is run across a bar- code reader to confirm that it was returned by the household that received it and examined by people for things like blobs of peanut butter that might gum up the system, then fed into oven-size scanners made by Eastman Kodak. Each scanner handles 23,400 pages a day.

Baltimore Sun Articles
Please note the green-lined linked article text has been applied commercially without any involvement from our newsroom editors, reporters or any other editorial staff.