New service helps researchers manage, share data

When you have to manage a lot of information, who you gonna call?

At Cornell, you want the Research Data Management Service Group (RDMSG), co-sponsored by Cornell University Library and the Office of the Vice Provost for Research. The RDMSG helps researchers plan how to organize and document their data, find a place to keep it and, perhaps most important, deal with the data management requirements of funding agencies.

The group draws on the resources of Cornell Library, the Center for Advanced Computing (CAC), Cornell Information Technologies (CIT), the Cornell Institute for Social and Economic Research (CISER) and Weill Cornell Medical College Information Technologies and Services. An important stimulus for its creation was the announcement that funding proposals submitted to the National Science Foundation (NSF) must include a "data management plan" to describe how data generated by research will be shared. The National Institutes of Health and the National Endowment for the Humanities are adding similar requirements, and some Cornell researchers have data management needs that go even beyond agency requirements.

"Various groups on campus have been talking about this brave new world of data-driven science, and how do we support it," said Gail Steinhart, research data and environmental sciences librarian at Albert R. Mann Library, an RDMSG team ember who helped design the program. "We are in the first wave of researchers needing help. Most scientists when faced with the choice between documenting their data or doing more science are going to choose more science."

Science is "open source." Researchers publish descriptions of their work so that others can repeat or extend it. Likewise, when they make the data they collect public, others can try analyzing it in new ways. A "data set" could consist of readouts from the Large Hadron Collider or reports to Cornell's Lab or Ornithology from backyard bird-feeder observers. Some data sets can be huge: Astronomers collect terabytes of data scanning the skies, and social scientists tabulate the activities of millions of Facebook and Twitter users.

"A lot of our consultations include recommending places where [researchers] could place their data," Steinhart said At Cornell, CISER, CIT, CAC and the library's eCommons repository offer places to store and share data, and scientific journals and professional societies offer other options..

But planning to share data is not just a matter of finding space. Each site may use different data formats, from an Excel spreadsheet to a multitable mainframe database. Some data, such as medical records or student grades, must be kept confidential. Data sets also must include metadata -- "data about data" -- that records how the data was collected, where and by whom and how it's organized. Current computer formats may become outdated, so systems must be established to copy and recopy files to new formats. Researchers also must plan to pass on responsibility for the data if the entity that created it ceases to exist. (Catch-22: Preservation costs may endure long after the expiration of research funding.) Finally, there may be intellectual property issues: Does the data set contain copyrighted material?

The data management plans required by funding agencies must describe all this and more. RDMSG experts advise both on how to accomplish it and how to describe it. Their website at http://data.research.cornell.edu is the place to start.