Oct. 23, 2014

A fast track for data between Ithaca and New York City

A spin-off from research aimed at improving the performance of the Internet will give Cornell researchers a faster way to transfer large files between the Ithaca campus and Weill Cornell Medical College (WCMC) in New York City, starting with massive files of genomics data.

If the project, dubbed the Cornell Open Science Network (COSciN), is successful it could help researchers everywhere, so it is being supported by a $1 million grant from the National Science Foundation (NSF), from a special fund devoted to upgrading the cyberinfrastructure of university research. The techniques might also be applied to move data around in large data centers (“The Cloud”).

Principal investigators are Nate Foster and Hakim Weatherspoon, assistant professors of computer science, and Jason Mezey, associate professor of biological statistics and computational biology. They will partner with Cornell Information Technologies and WCMC information technologies technicians. Foster also is working on an NSF-funded project to improve overall traffic flow on the Internet.

Mezey will be the first user of the new system. He has a joint appointment in the Department of Genetic Medicine at WCMC, and his research group spans both campuses. Researchers at Weill are sequencing entire human genomes using next-generation technologies, and Mezey analyzes the results on a high-performance computer cluster in Ithaca. With these analyses they determine the genetic ancestry of individuals and identify genes important for diseases and other complicated aspects of human physiology.

Recently, for example, by analyzing whole genomes of people recruited as part of a joint study between the WCMC and Weill Cornell-Qatar campuses, they have found evidence of a continuous population on the Arabian Peninsula since humans migrated out of Africa.

In another collaborative study, by comparing cells from the lungs of smokers and nonsmokers his group identified genes that are expressed differently when exposed to low levels of tobacco smoke. One important conclusion: Even low levels of second-hand smoke can affect a healthy lung.

Sequencing data and analysis files associated with one human genome can occupy up to a terabyte of storage space (1 trillion bytes, or about one-tenth of the Library of Congress), and Mezey also works with data sets provided by the National Institutes of Health (NIH) and other institutions that can take up hundreds of terabytes. Transferring a data set from New York to Ithaca electronically can sometimes take up to a week, even with Cornell’s dedicated 10-gigabyte-per-second connection. When they need to share data with collaborators, Mezey’s team mostly put the data on hard drives and mail them – which sometimes takes two weeks. NIH and others, meanwhile, refuse to mail hard drives, so the only way to get their data is by wire, however slow that may be.

Huge files of genomics data are becoming common, the researchers noted, because the cost and complexity of sequencing a genome is rapidly decreasing. And modern electronic data collection is generating huge files in many disciplines. Cornell has at least 10 laboratories that work with data sets as large as Mezeys.

The team will deploy software-defined networking (SDN), which allows network hardware to be controlled remotely, to configure routers and switches between New York and Ithaca to give preferred status to the big data packages, and technicians will add programmable hardware where it hasn’t been before. The SDN commands, in turn, will be issued by a program written in a language called Frenetic, developed by Foster to allow programmers to write networking programs without having to understand the electrical signals network devices use.

Weatherspoon will provide further control with SoNIC (SOftware defined Network InterfaCe), a technology that allows direct observation and manipulation of the pulses of light passing through a fiber-optic system. These observations can show the available bandwidth along various network paths and identify outages and other problems, allowing the COSciN system to direct big data traffic to the best route.

A special problem is that WCMC deals with confidential medical information protected by federal law, so its networks are configured to filter for such data, which can delay the transfer. The new project will include the creation of new, separate network paths into WCMC for the transfer of large databases.

If successful, the project will be extended to the WCMC-Qatar and to pathways to other universities and government agencies.