Under Construction: The Data Interstate System


We live in boom times for data infrastructure. As more research fields embrace computational methods and new technologies generate larger and larger streams of data, universities, governments, disciplines, and companies are investing heavily in repositories, storage facilities, and other data resources. But as all of these efforts scale up independently, they carry the risk of spreading out important knowledge over hundreds or thousands of localities, creating obstacles and headaches for individuals and collaborations seeking data they need to advance discovery and education.
A new initiative, the National Data Service, seeks to build up the connective tissue between these swelling stores of data, ensuring that progress won’t be slowed by sequestration. Just as the construction of the U.S. interstate system in the 1950’s connected the country’s cities, to the enormous benefit of the nation’s economy, the NDS hopes to construct a unified framework for the country’s scientific resources, where scientists can publish, discover, and collaboratively build upon the data needed for tomorrow’s great discoveries.
Led by the National Center for Supercomputing Applications (NCSA) and a consortium including the University of Chicago, the University of Illinois Urbana-Champaign, and the University of Texas at Austin, the NDS is currently honing its mission and recruiting partners from government, academia, and industry. Among the founding members is the CI’s Globus, which shares the NDS vision for accelerating discovery by making it easier to work with research data.
“The National Data Service is trying to catalyze a national community of people who would all join together and establish the necessary connections and services to greatly accelerate the amount of data that is stored, shared, and used in science,” said Ian Foster, director of the Computation Institute and primary investigator of Globus. “This is an exciting, broad-based initiative developing, and we’re looking for others to join in and participate.”
In their white paper and on their website, the NDS proposes building their framework around five principles: Integration, Discovery, Security, Publication, and Identifiers. By linking together diverse data sources into a unified system, users can easily search the entire universe of scientific data and, with identity services provided by Globus, access it from the proper source.
On the other end, the NDS can provide an easy system for scientists to publish and catalog their data, then link to it from associated journal articles, so that others can find, acquire, and build upon their work. Here too, Globus services can play a key role.
“We think the work we’re doing on Globus Publication and data discovery services is going to be an important element of NDS,” Foster said. “We hope to create a set of data services in which one can publish data very easily, store it at one of these locations, and discover data that has been published by different people, access it, and use it in research.”
To playfully illustrate how a scientist might use this framework, the NCSA prepared a video introduction to the NDS, showing how it would act as a one-stop shop for everything from gathering data to moving it for analysis at supercomputing centers to publishing the findings built upon it.

For such an ambitious idea to gain traction, the first steps will be to demonstrate these capabilities on a smaller scale. Last June, the consortium announced the establishment of a Materials Data Facility, supporting the White House Office of Science and Technology Policy’s Material Genome Initiative (MGI). A massive, $250 million materials science research effort spanning multiple institutions and federal agencies, the MGI is an ideal demo for the NDS concept, with data- and computation-heavy science underway simultaneously at dozens of far-flung locations.
“This will be the first online facility to build on the objectives of the National Data Service by providing open access to as broad a range of materials science data,” said Ed Seidel, director of the NCSA, in that announcement. “This is a terrific opportunity to accelerate materials discovery and advance manufacturing, by deeply connecting research, data and publication activities.”
Other major scientific challenges such as climate change, the human brain, and dark energy may offer additional targets for early NDS pilots. In the meantime, the founders also work to broaden the circle of participants, including national organizations, data and cyberinfrastructure providers, and scientific publishers.
“We have several publishers that we’re working with as part of the NDS consortium and pilot with the goal of driving connections between data publication and publication of scientific papers,” Foster said. “In principle, when you publish a scientific paper, you can include references in it to data that is stored in NDS facilities, and then people can find and access that data.”
In the end, NDS hopes to create a user-friendly system for scientists, teachers, and the public to find and use scientific data and associated tools, without having to navigate the complex technical and legal obstacles of today’s scattered resources. As libraries of data grow around the country, the framework of the National Data Service will allow the pace of scientific discovery to maintain its speed.

Written By: