April 17, 1998

Sifting through the jumble: A Cornell researcher finds a new way of retrieving just the right information from the web

The World Wide Web is an endless source of information, but with literally millions of pages posted by everyone from governments, universities and corporations to sixth-graders and conspiracy theorists, it's getting harder and harder to find precisely the right information.

Now a Cornell University researcher has come up with a method of searching the web that can return a list of the most valuable sites on a given topic, as well as a list of sites that index the subject. Early tests of the method have produced highly focused lists of sites on many topics, often comparable to lists carefully compiled by web search experts.

The method was developed by Jon Kleinberg, Cornell professor of computer science. An evaluation of the method was presented at the seventh International World Wide Web Conference held April 14-18 in Brisbane, Australia, in a paper by Kleinberg, David Gibson of the Department of Computer Science, University of California at Berkeley, and several IBM researchers.

Popular web-searching tools, known as engines, such as Yahoo! and AltaVista, work by hunting for keywords in the text of web pages. On some topics this can return hundreds or even thousands of pages. The algorithm (a set of rules specifying how to solve the problem) developed by Kleinberg instead works by analyzing the way web pages are linked to one another. The assumption behind this is that the most authoritative pages on a given subject will be those that are most often pointed to by other pages.

The web is annotated with "precisely the type of human judgment we need to identify authority," Kleinberg explains. "It almost says something about the way the web has evolved. I think it's about the way people link information in general, not just on the web."

Kleinberg's method does more than just identify pages with useful information about a topic, which he calls "authorities." The method also looks for pages that contain many links to pages with useful information on the topic, which he calls "hubs."

The best authorities, Kleinberg says, will be those that point to the best hubs, and the best hubs will be the ones that point to the best authorities. Kleinberg prevents this from becoming a circular definition by recalculating the relationship several times, each time moving closer to the ideal result.

He has written a search program using this technique called HITS (for Hyperlink-Induced Topic Search). HITS begins by conducting an ordinary text-based search on a topic using a search engine such as AltaVista. This collects a "root set" of about 200 pages that contain the entered keywords. It then expands the set to include all the pages linked to by pages in the root set. The expanded set might include from 1,000 to 3,000 pages.

From there on, text is ignored, and the application only looks at the way pages in the expanded set are linked to one another. The first time through, it identifies the pages that are pointed to most often by other pages, and assigns them a score, or "weight," indicating that they are more likely to be authorities. At the same time it notes the pages that contain more links to other pages and gives them more weight as hubs.

This calculation is repeated several times. Each time the program gives more authority weight to sites that link to sites with more hub weight, and more hub weight to sites that link to sites with more authority weight. Ten repetitions, Kleinberg says, are enough to return surprisingly focused lists of authorities and hubs.

The system overcomes several of the problems frequently identified with text-based searches. For example, at one time a text-based search for "Gates" didn't return the Microsoft Corp. home page because Microsoft chairman Bill Gates wasn't mentioned on the opening page. (He still isn't, but now his biography can be found by following the link "About Microsoft.") A search for "jaguar" returns a jumble of pages about cars, animals, the Jacksonville Jaguars NFL team, and the obsolete but still much-discussed Atari Jaguar computer.

In a case where a word represents more than one topic, Kleinberg's method automatically separates sites into "communities" of hubs and authorities, each representing one of the possible topics. Thus a HITS search on "jaguar" lists first a community of sites related to the Jaguar computer, because the number of web sites on this subject predominate. Further down, it listed communities relating to the football team and the car. Finally it finds sparse information relating to the animal, because this topic is simply not well represented on the web, Kleinberg says.

Communities also form when a topic is polarized: A search on "abortion" returns separate communities of pro-life and pro-choice sites, because the sites within each community link more densely to one other than to sites advocating an opposing view.

One disadvantage of the method, Kleinberg says, is that it doesn't always work for sharply focused queries. A search for "Netscape 4.04," for example, returns a general list of sites about web browsers.

The paper being presented in Brisbane is titled "Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text." Another paper by Kleinberg, "Authoritative Sources in a Hyperlinked Environment," was published in the Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. A related paper, "Inferring Web Communities from Link Topology," by Kleinberg, Gibson and Prabhakar Raghavan of the IBM Almaden Research Center, appears in the Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 1998.

Kleinberg developed the method while working as a visiting scientist at IBM's Almaden Research Center, on leave from Cornell. IBM has applied for a patent on the algorithm.