Select Page

eScience Reports, Day 2

By Rob Mitchum // October 11, 2012

The 2012 IEEE International Conference on eScience is taking place in Chicago this year, and we’ll be there Wednesday through Friday to report on talks about the latest in computational research. We’ll update the blog throughout the conference (subject to wifi and electrical outlet availability), and will tweet from the talks@Comp_Inst using the hashtag #eScience.

How to Get to All That Data (and When Do the Robots Take Over) (1:00 – 3:00)

A lot of information was shared this afternoon at the conference about the voting habits of people living in Melbourne, Australia. Two different, but related projects from Down Under — the Australian Urban Research Infrastructure Network and esocialscience.org — demonstrated their web-based portals for sharing datasets collected about the country, and both chose to map the distribution of voters for the two major parties in Australia, the Labour party and the Liberals (who are actually conservative, we learned). The presentations, by Gerson Galang at the University of Melbourne and Nigel Ward of the University of Queensland, showed both the mountains of data available to researchers with a few clicks in their browser and the very complicated machinery “under the hood” that makes such voluminous information — along with the analysis and visualization tools often needed by those researchers — so easily accessible.

Perhaps the most ambitious infrastructure presented was The Earth System Grid Federation, an international service for sharing software and data on topics such as climate change. The data challenges of this field are immense, said presenter Luca Cinquini; currently at the petabyte scale, and expected to reach exascale levels within the next 5 years, with the next generation of satellites expected to generate terabytes of new data each day. The ESGF prepared for this information flood by moving to a peer-to-peer infrastructure, where multiple “nodes” around the world store and serve the data. Cinquini gave a case study example of a project — predicting the sea surface temperature under different emission scenarios over the next 20 years — that would have taken multiple students months to complete just five years ago, but can take mere minutes today using the ESGF web interface.

But what about when the sheer amount of data being collected becomes too much for humans to sort through? That problem is already here for astronomy, where multiple telescope surveys are bringing in one-tenth of a terabyte of new data every night. That might not sound like much compared to the above example, but astronomy data is very complex, with many rare and diverse events potentially hidden within each image of the visible universe. Some citizen science projects have launched to help the professional astronomers go through images that may hold potentially interesting information, but the flood of data now — never mind in the next decade, when the stream grows to over 1 TB/night — is already too much for volunteers to keep up with. “There isn’t enough human attention in the world to look at all the data we are getting now, let alone years from now,” said S. George Djorgovski in his afternoon talk in the Data Mining and Machine Learning session.

So Djorgovski, from Caltech, talked about machine learning techniques that put the workload onto the shoulders of computers. Using machine learning algorithms, computers can be taught to not only detect whether something interesting is in a given image, but also to start classifying that interesting object, running down a hierarchical list of classifiers (i.e. “Is that a supernova, or not a supernova?”). The program can then spit out a list of high-priority objects for human follow-up, and even recommend the proper methods to best examine the object of interest.

“The idea is robots talking to robots,” Djorgovski said. “My goal is to wake up, have an espresso, and find out what my robots discovered last night.”

What Computation Brings to Biology (10:30 – 12:00)

An interesting quality of computational science is how flexible it is, particularly in how it can form a layer over any existing scientific field and enlarge the possibilities within. One of the late-morning sessions today focused on biological applications of computation, with examples of how using algorithms to find predictions in complex datasets can save life scientists time and money.

Much of this early crossover is currently taking place in the world of genetics, where new technologies are enabling faster, cheaper, and easier data collection. Taghrid Samak, from the University of California, Berkeley, presented her work with scientists studying gene synthesis, researchers who want to create genes to do things like dissolve plant cellulose into biofuel. Samak works with a laboratory that is looking in the cow stomach for such enzymes, but are faced with a challenge: there are many, many enzymes to choose from, and not all of them will be useful. Specifically, if an enzyme is insoluble, it would be a waste of time to do the extensive biological work it takes to determine its function. So Samak and her collaborators are working on a model that takes the gene sequence and other primary features of a given enzyme and predicts its solubility, using machine learning techniques that test the value of each feature for making an accurate prediction. The best models they’ve created achieve 90% accuracy, a valuable tool for pointing scientists to the most fruitful enzymes and away from those that are a waste of resources.

Mina Cintho from the University of Sao Paulo talked about a similar project that focuses on a different question: how can you predict whether a person with HIV will be resistant to one of the treatment options currently available? Drug resistance is usually caused by genetic mutations in the virus that allow it to evade the drug’s activity, and scientists have cataloged several common examples of these mutations. Cintho’s work involves looking for clusters of mutations, comparing them to available data about drug resistance to find their clinical phenotype (if known), then working back to create rules that highlight specific amino acids that are often mutated in drug resistant patients. That knowledge can help doctors that treat HIV patients to individually streamline the treatment of an individual patient — avoiding those drugs that they know in advance will not work.

Reproducibility in Computational Science (Keynote 8:30 – 10:00)

Computation is often described as an entirely new kind of science that will change the way that research is conducted across all fields. But this new science must still be governed by the centuries-old rules of the scientific method, such as the need for reproducibility. When results of an experiment are published, the authors also include a detailed methods section so that another scientist could recreate the experiment in their own laboratory and confirm the truth of the findings. While this may be relatively straightforward for the traditional “wet” laboratory, computational science makes things considerably more complicated, said Carole Goble of the University of Manchester in her keynote talk.

In some ways, reproducibility should be easier in computational science. Because the experiments inherently take place on a computer, archiving the digital data and the software codes used as methods should simply be a matter of recording and sharing those steps and results for anyone who wants to re-do the work themselves. But while this convenience should make computational science “the pinnacle of reproducibility,” it simply isn’t, Goble said. For one, the same outdated publication system is still used by computational scientists, “a 19th century way of reporting of 21st century science, where PDFs are printed on to bits of tree,” she said. Most scientists are also not in the habit of tracking their activities in the digital domain like they are with the classic paper lab notebook. And there’s the perennial paranoia about completely sharing data and software — Goble called the dominant frame of mind “data flirting,” when scientists describe enough about their data and code for you to be excited about it, but not enough for you to actually use it.

Fortunately, Goble described a flourishing ecosystem of tools that can help scientists keep a virtual lab notebook, share their data and code, and enable other scientists to reproduce their work if they wish. Some of the most exciting possibilities lie in the realm of “active publications,” interactive scientific findings with embedded software that allows readers to repeat the analyses that were originally used to generate the results. But in another sense, the proliferation of computational tools available to scientists can also be an obstacle to reproducibility. In the computational world, the “laboratory” is made up of the various programs used by the scientist in conducting his or her experiment, and exactly repeating those results would require using the same array of sometimes expensive, often customized and evolving software. When considering the complicated workflow of large computational collaborations, the scale of this problem is magnified several times over.

Goble reassured her audience that sometimes partial reproducibility may be enough. Just the publication of an activity log that can be read by outside observers may be enough to reassure readers that the work was performed properly and the results can be trusted (“we’re just going to have to describe the s— out of everything,” was the typically plainspoken way Goble put it). Scientists in all fields also need incentives to make their work reproducible, putting some of the burden on journals and institutions to write and enforce rules about sharing data and code for an accepted publication. While it might not be the sexiest mission for a scientific community, carving out a new way of making science reproducible will not only make for stronger computational science, but could prop up the field as a model for the broader scientific community. “We can do better than those wet lab guys, not worse,” Goble said.