Select Page

A Second Life for Newspapers in the Text Mine

By Rob Mitchum // December 7, 2012

Newspapers don’t always have the most exciting afterlife. A day or two after printing, most newspapers retire to a secondary role as kindling for the fireplace, stuffing for fragile items or a disposable surface for house-training pets. But the content of newspaper articles can have value long after publication for researchers interested in the daily, local pulse of a particular subject. Traditionally, information was extracted from old newspaper clippings by arduously crawling through endless microfiche files or (more recently) web pages. But new methods for text mining offer a fast, automated way to turn old newspaper articles into valuable information — which can then be poured into even more ambitious project.

Those methods were the backbone of a talk at the Computation Institute by John T. Murphy of the CI and Argonne National Laboratory’s Decision and Information Sciences Division. An anthropologist, Murphy is interested in the ways that towns in the American west handle water management — a utility that many of us take for granted, but which can be a bitter political battlefield. To sum up these disputes, Murphy referenced a quote often attributed, albeit probably falsely, to Mark Twain: “Whiskey is for drinking, water is for fighting over.”

Murphy wanted to examine these “water wars” in four western communities: Las Vegas, Colorado’s Grand Junction and Arizona’s Flagstaff and Tucson. The different history of each of these towns has led to vastly different challenges and battles today over the water supply, from the rising demands of the fast-growing Las Vegas area to the conservationist priorities of Flagstaff. A social scientist would typically need to spend a large amount of time and resources conducting ethnographic interviews and research, just to survey each unique situation and discover the large cast of characters — the government agencies, private companies, scientists and citizen groups — involved in each setting’s water management. Murphy and a group of collaborators — CI fellow Jonathan Ozik, Mark Altaweel (University College London), Lil Alessa, and Andy Kliskey (University of Alaska, Anchorage), Richard Lammers (University of New Hampshire) and Nick Collier (Argonne and CI) — decided to take a different approach, letting newspaper articles do the work.

“We are trying to data mine essentially not what people are saying out loud, but what they are writing to each other to try to find out something about how they’re managing water,” Murphy said. “We’re trying to get a window into how people are reacting to the problems water can cause by figuring out how they perceive the condition of their water resource.”

So in the four community areas of interest, the research team designed software that went to the website of town newspapers and downloaded newspaper articles covering several-year spans. In some cases, all of the newspaper articles, as captured by searching for the word “the”. This chaotic ocean of text is then organized by software called UIMA (Unstructured Information Management Architecture) which can categorize pieces of text as nouns, verbs, proper nouns and other forms of language. Murphy and his colleagues further trained the UIMA content analysis to recognize words related to water — not just “water” itself, but also related terms such as flood, irrigation or downpour. Based on this analysis, each article is given a score based on its number of water-related topics, so that researchers can quickly find the most relevant articles and conduct more complex searches (such as articles dealing with both “water” and “climate”).

While some of these tasks might be possible with expert search engine skills, the text-mining pipeline can also be tapped to filter out the primary “actors” in the water management of each town. The research team coded UIMA to locate water authorities within the documents, looking for various elements overlapping with water-relatedness that indicate an organization: an acronym, a preceding “the,” proper nouns. Apply these rules to the full database of text from a given newspaper, and the system can produce a list of potential water authorities ranked by their likely relevance. In the examples Murphy showed, the system still needed some tuning — in addition to water suppliers, government agencies and hydrologic research groups, it also captured laws and places related to water issues. But a human observer, armed with Google, could quickly sort out the hits from the misses.

Once the players are determined for each region, Murphy hopes to use additional text-mining to detect the relationships between those groups, eventually producing the network of connections that lie beneath an area’s water management. Such a network could then be adapted for use in agent-based modeling, to simulate how different systems perform when tested with different stresses. For example, in a region with multiple water suppliers (as is the case in Grand Junction), the researchers can simulate what happens when one of goes out of business. Ideally, the text-mining pipeline will also be robust enough to be applied to analyze any town’s water management…so long as they still have a newspaper.