Trolling the DEEP WEB

Most users only skim the surface of the Internet -- that's all they can access

May 28, 2001|By Jackie Loohauis | Jackie Loohauis,KNIGHT RIDDER/TRIBUNE

This spring, a New York man became ill and signed on to his computer for some help.

He hunted on the Net for a site that would explain his symptoms. He found none on his first search-engine try. Using other search techniques - he found a medical site with an answer, the answer being that he needed to have himself rushed to the hospital immediately. He did and had an emergency bypass operation minutes after his arrival.

The coronary patient was lucky that he could find what he needed in his Internet search without a whole lot of delay. Because the notion that the Net is a vast encyclopedia just waiting to be opened by search engines and directories such as AltaVista or Yahoo! is a myth.

Traditional search engines have access to only a fraction of 1 percent of what exists on the Web, according to BrightPlanet, an Internet search company, noting that as many as 550 billion pieces of content are hidden from most search engine scrutiny. These documents make up what is known as "The Deep Web."

Undercover and under-covered, the vast reservoir of the Deep Web is estimated to be 500 times larger than the "surface" World Wide Web. And, according to BrightPlanet, the Deep Web is the largest growing category of new information on the Net.

"There's a huge amount of information you can't find entirely or easily via a search engine," says Net search guru Gary Price, a librarian at George Washington University and co-author of the upcoming book "The Invisible Web" (CyberAge Books, $29.95). "The material on the Web is unorganized, very ephemeral. There's no rhyme or reason, no language control. The Web is a huge directory that's very hard to get at."

The biggest part of this invisible Web is information stored in databases - massive libraries of Web content unsearchable through such tools as Yahoo! and Google. You have to know they exist before you can search them.

Such a database would be the Government Printing Office listings at www.access.gpo.gov/sudocs/aces/aaces002.html. There are thousands more.

Other aspects of the Net remain hidden in deep waters, too.

"There are tons of things out there," says Tara Calishain of Researchbuzz.com, an online Internet guide. "Pay-content sources, lots of genealogy sources. The Library of Congress [www.loc.gov] has fabulous collections you can't find on AltaVista."

Several types of information are most elusive for search engines - bibliographies, multimedia files, information that comes in .pdf files (Adobe's portable document format). "News is dreadful, says Calishain. "Search engines don't cover it. It's tough to find breaking news."

Some sites, such as Amazon.com have sections so far from the surface of their home pages that they, too, can be classified as Deep Web, says David Crane, a spokesman for search engine Google (www.google.com). An example, says Crane, is "the section that specifically offers a `portable compact disc player by Sony.' "

But the deepest Deep Web drop-off is in the category of government, and it's getting deeper.

"More and more city and county governments are putting their offerings on the Web. The State of Pennsylvania has a new crime reporting database [http://ucr.psp.state.pa.us/UCR/ComMain.asp], and more and more of that kind of thing is coming up now," says Calishain.

There are other reasons why why these types of pages seemingly wear camouflage. For example, consider the confusion that everyday English can cause on a "natural language" search tool like Ask Jeeves (www.aj.com). Ask "How tall is a giraffe?" and you might get an answer. Ask "A giraffe is how tall?" and the search engine will see a different, perhaps unanswerable, question. Also, "You call it a `bubbler.' I call it a `water foun- tain,' " says Price.

Sometimes, portions of the Web remain invisible because they only surface for money.

John December, president of Milwaukee-based December Communications, says, "There is proprietary, for-sale content, or parts of the Web that are accessible by subscription like LexisNexis. People don't realize that not everything is free on the Web. They're shocked when they find out."

But the depths of the Web remain invisible largely because of the way search engines work. They get their information two ways.

First, a smattering of sites are indexed because authors submit their own Web pages for listings.

But search engines find most of their material by "crawling" or "spidering" the documents, following one hypertext link to another, like ripples in a pond. These ripples can obscure the waters for a searcher by providing too many indiscriminate results. Some Web designers even manipulate the system simply by invisibly coding one word over and over again on a page to get better play in the search-results listing.

The age of the document is also a factor. New documents are found from links with older documents, and those older pages with a larger number of references have a far greater chance of being indexed by a search engine.

Baltimore Sun Articles
|
|
|
Please note the green-lined linked article text has been applied commercially without any involvement from our newsroom editors, reporters or any other editorial staff.