Cornell researchers receive $2 million federal grant for computational social sciences project using Web archive

A team of Cornell University researchers has been awarded a $2 million National Science Foundation (NSF) grant to develop advanced Web tools for social sciences research.

Ultimately intended to assist in the detailed statistical and observational study of social and information networks, the project will involve a team of computer scientists and social scientists developing the means -- dubbed "cybertools" -- to extract and analyze information from vast collections of data.

The project's primary source of data will be the Internet Archive http://www.archive.org, which is supported by the NSF and the Library of Congress, among others. One of the first steps in the project, which is funded through 2007, will be to transfer 30 percent, or 200 terabytes, of the massive archive to a computer server at Cornell for use by researchers.

Developed by Brewster Kahle in 1996 and based at the Presidio in San Francisco, the archive comprises more than 40 billion Web pages. "This archive is the only copy that has been saved of how the Web has developed over the years," Cornell computer scientist William Arms said. It includes text, audio, moving images and software, as well as archived Web pages.

"Faculty in computer science and the social sciences have been working together for many years at Cornell," said Michael W. Macy, sociology department chair and the project's principal investigator. "Cornell has the potential to be one of the leaders in computational social science; we have all of the pieces of the puzzle here."

Other principals in the cybertools project are sociologist David Strang and computer scientists Dan Huttenlocher and Jon Kleinberg, who was recently awarded a MacArthur Foundation Fellowship.

The Cornell project was among the finalists for funding when Huttenlocher made the cybertools presentation to the NSF in Washington on Aug. 1. Macy, who was in Japan at the time, also participated via speakerphone. The project proposal's official title is "Very Large Semi-Structured Datasets for Social Science Research."

"The Web is this amazing potential resource for data for social sciences work, but that takes some social scientists willing to be kind of guinea pigs and computer scientists willing to set aside their own interests," said Huttenlocher, who teaches technology management in the Johnson Graduate School of Management.

The computational social sciences research will include studies of the process of diffusion of innovation -- which includes the spread of new technologies, social and business practices, markets, fads and fashions; as well as norms, opinions and urban legends.

"In 1972, the NSF began the General Social Survey, which became a mainstay of social science research," Macy said. "It is a very powerful tool. We see the tools we are building as having a similar impact in that they will open up to social scientists a wide array of ways to study social life we've never had access to in the past."

Web logs (personal online diaries also known as "blogs") on services such as Livejournal and interactive community databases including the student directory Facebook also will provide data, because, unlike non-virtual communities, every interaction is recorded.

"Social life is remarkably difficult to study," Macy said. "We have reams and reams of statistics, but what we don't have -- and what it has been hard to get access to -- is interaction between the participants."

Professor of communication Geri Gay, who recently joined the cybertools team, has two undergraduate communication students who have already begun to collect data from Livejournal.

"It's not only tracking what everybody posts, but information about the poster -- age, gender, interests, lists of all their friends," Macy said. "Of course, we don't know how truthful people are being, but we do know how others in the network are perceiving these demographic profiles, and that is also going to be very interesting to study as we map the opinion dynamics over time."

Among the areas of study the cybertools project will touch on are the evolution of social norms and polarization of opinion in evolving networks -- "seeing how network structure affects opinions among friends and enemies and how opinions in turn shape an evolving network structure," Macy said.

The cybertools research is part of "Getting Connected: Social Science in the Age of Networks," the 2005-08 interdisciplinary theme project of Cornell's Institute for the Social Sciences (ISS). Theme projects such as the current "Evolving Family" effort involve research projects, courses, events such as lectures by guest speakers and the engagement of constituencies both on and off campus.

"The NSF said they really did like the idea that we were making a commitment to studying networks, and that this was an interdisciplinary project over a long period of time," said David Harris, ISS executive director

Macy also helped to write the networks proposal chosen for the ISS theme project and is the leader of its 10-member team, which involves scholars in disciplines including sociology, economics, mathematics, psychology and communication.

"We really tried to maximize the interdisciplinary nature of the group, as well as schools they were in, the kinds of things they were studying and the quality of the research they brought in," said ISS Director Elizabeth Mannix, who is in charge of the networks project.

"In the intersection of the social sciences community and the information sciences community, there's a very technical side and a very social side that really need to start talking to each other," Mannix said. "We are in a unique position at Cornell to do that."

Media Contact

Media Relations Office