Inside the Discovery Cloud: Deep Text Mining for Cancer & Disease


Text mining is often discussed in the context of humanities research or marketing, where an enormous pool of text can be computationally sifted for new insight or targeted advertising. But text mining is also gaining a foothold in biology and medicine, as researchers increasingly realize that the millions upon milions of journal articles published in these fields may hold previously undiscovered insights for understanding and treating disease. In fact, the massive corpus of scientific literature may be a gold mine for scientists of both the social and life variety, allowing historians to reconstruct the bumpy path of science, providing new perspective on the current landscape of research, and suggesting future directions that may be most fruitful and cost-efficient.

For the final Inside the Discovery Cloud event of the 2014-15 academic year, two CI researchers covered this whole timeline, with a focus on how biological text mining creates a promising new approach to finding effective treatments for cancer. The topic, "Deep Text Mining for Cancer & Disease," stems from a unique, DARPA-funded collaboration between two CI research centers: Knowledge Lab and the Conte Center for Computational Neuropsychiatric Genomics. James Evans, director of Knowledge Lab, and Ishanu Chattopadhyay of the Conte Center and the Institute for Genomics and Systems Biology, talked about how their two research agendas found common ground in the fight against one of the world's deadliest diseases.

Evans provided an overview of the methods Knowledge Lab uses to extract knowledge from scientific papers, grant applications, insurance claims, and other sources. From these millions of documents, Evans' group constructed a global map of research priorities by country, finding basically zero correlation between disease burden and research attention, and created a Health Research Opportunity Index (or Health ROI) that identifies "overstudied" and "understudied" diseases. Another study extracted chemicals, methods, and other elements from papers to build systems that represent how research in a field generates new hypotheses to test -- and may recommend the experiments more likely to be successful in the future. 

Some of these networks mirror the work of the Conte Center in their search for new cancer treatments in the deep well of scientific literature. Chattopadhyay talked about how networks of molecular interactions and gene expression derived from text mining gives scientists new models for finding potential new drug targets -- or combinations of targets -- to slow or disrupt cancer cells. Using a combination of network theory and molecular biology, and genomics, these models provide predictions about effective therapies, including drug cocktails, that are then tested in the laboratory, generating new data that in turn can help improve the model.  

This event was the final installment of our 2014-2015 Inside the Discovery Cloud series on "Catalyzing Collaboration." If you missed any of the talks, or would like to revisit those that you attended, they are archived here. Stay tuned for the announcement of next year's series. 

Written By: