By Bill Steele
The National Science Foundation (NSF) has awarded Cornell $1.8 million to develop a computer system to store and manipulate massive amounts of data. Initially, the system will support three data-intensive research projects in astronomy, computer graphics and the Internet.
Eventually the system will hold 1 petabyte of data on an array of over 1,000 hard disk drives. A petabyte is 1,000 terabytes, and a terabyte is 1,000 gigabytes. A modern desktop computer's hard disk typically holds around 20 to 40 gigabytes. It's estimated that the entire contents of the Library of Congress could be stored in about 10 terabytes.
The hardware will be housed and maintained by Cornell Theory Center (CTC). Microsoft, Unisys and Intel are contributing to the project. Additional storage and networking upgrades will be purchased in subsequent years until the total storage is more than a petabyte.
"It goes beyond storage space," the project's principal investigator Alan Demers, Cornell professor of computer science, explained. "It's storage tightly coupled to something that can do serious computing on it. A storage system of this size is still put together out of independent disks, but there are many, many more of them, so you have to provide for multiple high-speed paths to the data."
The project will begin by supporting three research projects:
A team of astronomers led by James Cordes, Cornell professor of astronomy, working with Johannes Gehrke, Cornell assistant professor of computer science, and others in the computer science department's database group, will analyze data from Arecibo to find pulsars, which are fast-spinning neutron stars, and other distant objects. The Arecibo telescope is the world's largest radio telescope in terms of collecting area and thus can conduct the most sensitive surveys for pulsars. A new multi-beam feed system has increased the survey speed of the observatory by a factor of seven.
Arecibo is acquiring data on the order of a terabyte a day. This is too much to be sent over the Internet, at least with the connections available from Puerto Rico. Instead, the observatory ships the data on high-capacity disk packs. "Part of the work here is simply being able to acquire data at that rate," Demers says.
The proposed surveys for pulsars include searching the entire galactic plane of the Milky Way visible with Arecibo and searching outside the galactic plane in a shallower survey to find millisecond pulsars and binary pulsars. "We're basically in a discovery phase," Cordes says. "We know how to find the things we already know about [like pulsars]. We also expect there are objects we don't know about."
The pulsar surveys will be the deepest (that is, reaching to the greatest distances) ever undertaken and are expected to yield not only about 1,000 new pulsars, but also other exotic objects, including millisecond pulsars spinning near the break-up speed of a neutron star, neutron stars in compact binaries with orbital periods of a few hours or less, and companion stars to pulsars that are also neutron stars, or perhaps black holes, says Cordes.
The work will draw on expertise in data mining of the database group in the Department of Computer Science. "This is the first step in developing collaborations between astronomy and computer science," Cordes says.
Also generating huge amounts of data every day is a project by assistant professors of computer science Steve Marschner and Kavita Bala aimed at creating more realistic computer graphics by studying in detail the way light is reflected from real objects.
The researchers have mounted a camera on a spherical gantry. Complex three-dimensional objects, as well as materials such as skin, hair and cloth, will be illuminated from thousands of directions, and the reflected light will be measured by aiming the camera from thousands more directions. "Each image taken by the camera is 1 to 6 megabytes of data," Marschner explains. Photographing a single object from thousands of angles will generate approximately 50 terabytes of data.
The data can be used to computationally model the actual object, and used in research on the fundamental properties of materials as well as on how to represent complex objects efficiently and realistically. "What we get from this project is a place to store all this stuff, and a lot of computing power on a machine right next to the data," Marschner says.
Detailed imaging of real objects is of interest to archaeologists and librarians, among others, he says. One application, he suggests, is a virtual museum. "Rare artifacts can be digitized into representations that are accurate enough to produce highly realistic views from any reasonable distance and under any kind of lighting. The digitized images could create a real-time, fully realistic, immersive experience for a museum collection that could never be assembled in reality."
Another place to find a lot of data is the independent, nonprofit Internet Archive, which since 1996 has stored as much as possible of the entire Web in snapshots every three months.
Jon Kleinberg, associate professor of computer science, and Dan Huttenlocher, the John P. and Rilla Neafsey Professor of Computing, Information Science and Business at the Johnson Graduate School of Management, will download many of those snapshots, then analyze and compare them to develop precise models for how the Web evolves.
"Most contemporary research uses a current snapshot," Huttenlocher says. "Both of us have gotten interested in temporal models." The researchers will, for example, look at the traffic patterns of visitors to popular Web sites and how they change with time. Traffic patterns might change, Huttenlocher surmises, because of new hyperlinks or increased interest in a topic. Kleinberg has developed techniques for scanning text to determine what topics are "hot" in a given time period, and the researchers will try to correlate this with the behavior of individual Web sites.
"You might call it quantitative sociology," Huttenlocher says. "We hope to be discovering things about the structure of the Web that can be used to provide people with feedback, and so change the system."
The researchers also will study the "Deep Web" -- information stored in databases and other forms that are not immediately available to casual Web surfers -- and the related areas of scientific publishing and digital libraries.
| Cornell Chronicle Front Page | | Table of Contents | | Cornell News Service Home Page |