Spiders of `Google' spin wide web of information

Results: The popular search engine uses software programs that "crawl" along the Internet's many trails in quest of pertinent documents.

August 26, 2004|By Robert S. Boyd | Robert S. Boyd,KNIGHT RIDDER/TRIBUNE

WASHINGTON - What computer magic makes it possible for Google to pick out, in a fraction of a second, the information you want from the incredible mass of material heaped on the Web?

To answer users' queries, the system founded six years ago by two Stanford University graduate students has scanned and stored nearly 4.3 billion Web pages. If all those documents were printed, they'd make a stack of paper 300 miles high.

Some details of Google's methods are closely held trade secrets, but the broad outlines of how Google and its competitors work are well known to computer scientists.

In computer jargon, Google's "search engines" use robotic "spiders" - special software programs - that "crawl" continuously along the myriad trails of the World Wide Web, "harvesting" documents as they go. A separate piece of software builds an index of every word the spiders find.

When a user submits a query - such as "Mount Everest" or "Bill Clinton" - the search engine checks the index, fetches each document that contains those words, sorts them by relevance and returns the most pertinent ones first.

"For Google, the major operations are crawling, indexing and sorting," the system's founders, Sergey Brin and Lawrence Page, wrote in their original paper describing the system.

To improve the results, Google uses a patented method called "PageRank," a sort of popularity contest that tries to determine which documents are likely to be most valuable to the user.

For each page, the PageRank system counts the number of other pages that are linked, or connected, to it. In essence, Google interprets a link from Page A to Page B as a "vote" by Page A for Page B.

In addition to the number of votes a page receives, the system analyzes the status of the pages that cast the votes. Popular pages weigh more heavily in the calculation.

"Pages that are well cited from many places around the Web are worth looking at," Brin and Page wrote.

Google uses other tricks as well to determine a document's ranking. Words in a special typeface, bold, underlined or all capitals, get extra credit. Words occurring close together - such as "George" and "Bush" - count for more than those that are far apart. Finally, Google returns the documents that match a user's query, ranked in order of their relevance as determined by their page rank.

Here's how Google's stable of spiders, known as GoogleBots, go about their business:

A spider visits every Web page that isn't marked private, reads it and stores it in compressed form. The spider looks for any links that the page might contain to other pages. It follows those links to pages it hasn't seen before and continues the process until there are no more links to visit.

While the spider is chugging along, an "indexer" is creating a catalog or dictionary of every word it encounters, except for short words such as "the," "in" or "where." For each word, the system keeps a list of all the pages in which that word appears.

The lists can be extremely long, because some words appear in millions of documents. A search for "carnival" returns 5.6 million entries, far more than anyone could possibly use. The combination of "George" and "Bush" gets 7.4 million hits.

HOW GOOGLE WORKS

How this popular search engine finds what you are looking for on the Internet:

1. Spider software "Crawls" the Web; finds and fetches pages; follows links to other pages.

2. Indexer sorts words on every page spider finds; stores index of words in hugh database.

WHEN YOU SUBMIT A QUERY

Sample search: Mount Everest

3. Search engine checks index, gets each page that contains "Mount Everest."

4. Sorts pages using "PageRank"; decides which are likely to be most valuable. Returns most pertinent pages first.

SOURCE: KNIGHT RIDDER WASHINGTON BUREAU, KNIGHT RIDDER/TRIBUNE. ,

Baltimore Sun Articles
|
|
|
Please note the green-lined linked article text has been applied commercially without any involvement from our newsroom editors, reporters or any other editorial staff.