Cornell Chronicle index page Table of Contents Front page of this issue

Jon Kleinberg: Buzzwords of history show the way to Web searches

Kleinberg
By Bill Steele

In the years after the American Revolution, U.S. presidents were talking about the British a lot, and then about militias, France and Spain. In the mid-19th century, words like "emancipation," "slaves" and "rebellion" popped up in their speeches. In the early 20th century, presidents started using a lot of business-expansion words, soon to be replaced by "depression."

A couple of decades later they spoke of atoms and communism. By the 1990s, buzzwords prevailed.

Jon Kleinberg, a professor of computer science at Cornell, has developed a method for a computer to find the topics that dominate a discussion at a particular time by scanning large collections of documents for sudden, rapid bursts of words. Among other tests of the method, he scanned presidential State of the Union addresses from 1790 to the present and created a list of words that eerily reflects historical trends. The technique, he suggests, could have many "data mining" applications, including searching the Web or studying trends in society as reflected in Web pages.

Kleinberg emphasized the Web applications of his searching technique in a talk, "Web Structure and the Design of Search Algorithms," at the annual meeting of the American Association for the Advancement of Science in Denver, Feb. 18. He was taking part in a symposium on "Modeling the Internet and the World Wide Web."

Kleinberg got the idea of searching over time while trying to deal with his own flood of incoming e-mail. He reasoned that when an important topic comes up for discussion, key words related to the topic would show a sudden increase in frequency. A search for these words that suddenly appear more often might, he theorized, provide ways to categorize messages.

He devised a search algorithm that looks for "burstiness," measuring not just the number of times words appear, but the rate of increase in those numbers over time. Programs based on his algorithm can scan text that varies with time and flag the most "bursty" words. "The method is motivated by probability models used to analyze the behavior of communication networks, where burstiness occurs in the traffic due to congestion and hot spots," he explained.

In his own e-mail -- largely from other computer scientists -- he quickly found key words relating to hot topics. In mail from students he found bursts in the word "prelim" shortly before each midterm exam. Later, he tried the same technique on the texts of State of the Union addresses, all of which are available on the Web, from Washington in 1790 through George W. Bush in 2002. From these speeches he produced a long list of words that summarizes American politics from early revolutionary fervor up to the age of the modern speechwriter.

For searching the Web, Kleinberg suggested, such a technique could help zero in on what a searcher wants by recognizing the time context of such material as news stories. For instance, he said, a person searching for the word "sniper" today is likely to be looking for information about the recent attacks around the nation's capital -- but the same search nearly four decades ago might have come from someone interested in the Kennedy assassination.

February 20, 2003

| Cornell Chronicle Front Page | | Table of Contents | | Cornell News Service Home Page |