Sites left unseen with Net searches

Engines: A study finds that even the best keep track of only a fraction of the World Wide Web's content.

July 12, 1999|By Ashley Dunn | Ashley Dunn,LOS ANGELES TIMES

If searching the World Wide Web for that one nugget of information already seems like a bad trip into a quagmire of data, Internet researchers have a bit of bad news for you -- the situation is only getting worse.

Even the most comprehensive search engine today is aware of no more than 16 percent of the estimated 800 million pages on the Web, according to a study published last week in the scientific journal Nature. Moreover, the gap between what is posted on the Web and what is retrievable by the search engines is widening fast.

"The amount of information being indexed [by commonly used search engines] is increasing, but it's not increasing as fast as the amount of information that's being put on the Web," said Steve Lawrence, a researcher at NEC Research Institute in Princeton, N.J., one of the study's authors.

The findings are important because they raise the specter that the Internet might lead to a backward step in the distribution of knowledge at a time of technological revolution: The breakneck pace at which information is added to the Web might actually mean that more information is lost to easy public view than made available.

The study also underscores a little-understood feature of the Internet. While many users believe that Web pages are automatically available to the search programs employed by such sites as Yahoo, Excite, and AltaVista, the truth is that finding, identifying and categorizing new Web pages requires a great expenditure of time, money and technology.

Lawrence and his co-author, fellow NEC researcher C. Lee Giles, found that most of the major search engines index less than 10 percent of the Web. Even by combining all the major search engines, only 42 percent of the Web has been indexed, they found.

The rest of the Web -- trillions of bytes of data ranging from scientific papers to family photo albums -- exist in a kind of black hole of information, impenetrable by Web surfers unless they have the exact address of a given Web site. Even the pages that do end up indexed take an average of six months to be discovered by the search engines, Lawrence and Giles found.

The pace of indexing marks a striking decline from that found in a similar study conducted by the same researchers a year and a half ago.

At that time, they estimated the number of Web pages in the world at about 320 million. The most thorough search engine in that study, HotBot, covered about a third of all Web pages. Combined, the six leading search engines they surveyed covered about 60 percent of the Web.

While Web surfers often complain about retrieving too much information from search engines, said Oren Etzioni, chief technology officer of the portal Go2net and a professor of computer science at the University of Washington, failing to capture the full scope of the Web would be to surrender one of the most powerful parts of the digital revolution -- the ability to seek and share diverse information across the globe.

Etzioni said the mushrooming size of the Web's audience makes the gulf between what is on the Web and what is retrievable increasingly important.

"There is a real price to be paid if you are not comprehensive," he said. "There may be something that is important to only 1 percent of the people. Well, you're talking about maybe 100,000 people."

For search engine companies, the findings of the report were unsurprising.

Kris Carpenter, director of search products and services for Excite, the third most popular search engine, said her company purposely ignores a large part of the Web not so much because of weak technology, but a lack of consumer interest.

"Most consumers are overwhelmed with just the information that is out there," she said. "It's hard to fathom the hundreds of millions of pages. How do you get your head around that?"

Kevin Brown, director of marketing for Inktomi, whose search engine is used by the popular search sites HotBot, Snap and Yahoo, said that search companies have long been aware that they are indexing less and less of the Web. But he argued that users are seeking quality information, not merely quantity.

"There is a point of diminishing returns," he said. "If you want to find the best Thai food and there are 14,000 results, the question isn't how many returns you got, but what are the top 10."

Excite's Carpenter said the future of search engines lies not in bigger indexes, but more specialized ones, in which having everything on a given subject, such as baseball, could be indexed and displayed to viewers.

"You may be covering a huge percentage of the Web, but you're presenting it in smaller slices," she said. "Lumping everything into one big, be-everything index would be incredibly overwhelming."

Pub Date: 07/12/99

Baltimore Sun Articles
Please note the green-lined linked article text has been applied commercially without any involvement from our newsroom editors, reporters or any other editorial staff.