The World Wide Web is an endless source of information, but with literally millions of pages posted by everyone from governments, universities and corporations to sixth-graders and conspiracy theorists, it's getting harder and harder to find precisely the right information.
Now a Cornell researcher has come up with a method of searching the web that can return a list of the most valuable sites on a given topic, as well as a list of sites that index the subject. Early tests have produced lists comparable to those carefully compiled by human experts.
The method was developed by Jon Kleinberg, Cornell professor of computer science. An evaluation was presented at the seventh International World Wide Web Conference held April 14-18 in Australia.
Popular web-searching tools, known as engines, such as Yahoo! and AltaVista, work by hunting for keywords in the text of web pages. On some topics this can return hundreds or even thousands of pages. The algorithm (a set of rules specifying how to solve the problem) developed by Kleinberg instead works by analyzing the way web pages are linked to one another. The assumption behind this is that the most authoritative pages on a given subject will be those that are most often pointed to by other pages.
The web is annotated with "precisely the type of human judgment we need to identify authority," Kleinberg explained. "It almost says something about the way the web has evolved. I think it's about the way people link information in general, not just on the web."
Kleinberg's method locates pages with useful information about a topic, which he calls "authorities," and pages that contain many links to pages with useful information on the topic, which he calls "hubs."
The best authorities, Kleinberg said, will be those that point to the best hubs, and the best hubs will be the ones that point to the best authorities. Kleinberg prevents this from becoming a circular definition by recalculating the relationship several times, each time moving closer to the ideal result.
He has written a search program called HITS (for Hyperlink-Induced Topic Search). HITS begins by conducting an ordinary text-based search using a search engine such as AltaVista. This collects a "root set" of about 200 pages that contain the entered keywords. It then expands the set to include all the pages linked to by pages in the root set. The expanded set might include from 1,000 to 3,000 pages.
From there on, text is ignored, and the application only looks at the way pages in the expanded set are linked to one another. The first time through, it identifies the pages that are pointed to most often by other pages and assigns them a score, or "weight," indicating that they are more likely to be authorities. At the same time it notes the pages that contain more links to other pages and gives them more weight as hubs.
This calculation is repeated several times. Each time the program gives more authority weight to sites that link to sites with more hub weight, and more hub weight to sites that link to sites with more authority weight. Ten repetitions, Kleinberg said, are enough to return surprisingly focused lists of authorities and hubs.
The system overcomes some special problems of text-based searches. For example, at one time a text-based search for "Gates" didn't return the Microsoft Corp. home page because Microsoft chairman Bill Gates wasn't mentioned on the opening page. (He still isn't, but now his biography can be found by following the link "About Microsoft.") A search for "jaguar" returns a jumble of pages about cars, animals, the Jacksonville Jaguars NFL team and the obsolete but still much-discussed Atari Jaguar computer.
Where a word represents more than one topic, Kleinberg's method automatically separates sites into "communities" of hubs and authorities on each of the possible topics. A HITS search on "jaguar" lists first a community of sites related to the Jaguar computer, because the number of web sites on this subject predominate. Further down, it listed communities relating to the football team, and the car and the animal.
Communities also form when a topic is polarized: A search on "abortion" returns separate communities of pro-life and pro-choice sites, because the sites within each community link more densely to one other than to sites advocating an opposing view.
One disadvantage of the method, Kleinberg said, is that it doesn't always work for sharply focused queries. A search for "Netscape 4.04," for example, returns a general list of sites about web browsers.
The paper presented in Brisbane is titled "Automatic Resource List Compilation by Analyzing Hyperlink Structure and Associated Text." Another paper by Kleinberg, "Authoritative Sources in a Hyperlinked Environment," was published in the Proceedings of the 9th ACM-SIAM Symposium on Discrete Algorithms, 1998. A related paper, "Inferring Web Communities from Link Topology," by Kleinberg, Gibson and Prabhakar Raghavan of the IBM Almaden Research Center, appears in the Proceedings of the 9th ACM Conference on Hypertext and Hypermedia, 1998.
The texts of these papers can be found on Kleinberg's web page at http://www.cs.cornell.edu/home/kleinber/.
Kleinberg developed the method while working as a visiting scientist at IBM's Almaden Research Center, on leave from Cornell. IBM has applied for a patent on the algorithm.
| Cornell Chronicle Front Page | | Table of Contents | | Cornell News Service Home Page |