This week, some 25 cities around the world are hosting events online and offline as part of Big Data Week, described by its organizers as a "global community and festival of data." The Chicago portion of the event features several people from the Computation Institute, including two panels on Thursday: "Data Complexity in the Sciences: The Computation Institute" featuring Ian Foster, Charlie Catlett, Rayid Ghani and Bob George, and "Science Session with the Open Cloud Consortium" featuring Robert Grossman and his collaborators. Both events are in downtown Chicago, free, and you can register at the above links.
But the CI's participation in Big Data Week started with two webcast presentations on Tuesday and Wednesday that demonstrated the broad scope of the topic. The biggest data of all is being produced by simulations on the world's fastest supercomputers, including Argonne's Mira, the fourth-fastest machine in the world. Mira boasts the ability to 10 quadrillion floating point operations per second, but how do you make sense of the terabytes of data such powerful computation produces on a daily basis?
In his talk "Big Vis," Joseph Insley of Argonne and the CI explained how he and his team has developed equally impressive visualization technology to keep pace with Mira's data firehose. Tukey, a 96-node visualization cluster, is Mira's sidekick, sharing the same software and file systems with its big sibling to more easily take in data and transform it into images. Insley demonstrated how visualization was instrumental in two major simulations conducted on Mira: one studying arterial blood flow and aneurysm rupture in the brain, and another on nothing less than the evolution of the entire universe.
That simulation, called HACC for Hardware/Hybrid Accelerated Cosmology Code, is still ongoing. At its peak, the 1.1-trillion particle simulation uses two-thirds of Mira's 768,000 computing cores, Insley said, and completing the project will require roughly a billion computing hours. A single time step can produce 4 terabytes of data, roughly equivalent to the hard drive space on four high-end laptops. So Insley talked about the parallelization and rendering strategies his team used to turn that literal universe of data into the jaw-droppingly detailed images seen below.
Rayid Ghani's live-stream presentation on Wednesday dealt with an entirely different sort of big data, though one with an even more direct real-world impact. Ghani, who just recently joined the CI, served as the chief scientist for the Obama campaign's data analytics team, and described that team's efforts to use online social networks for achieving campaign goals. While anyone can post a message to their Facebook page and hope for friends to "Like" and share the information, Ghani described how his team developed a method of "targeted sharing," to maximize the campaign's online engagement.
The tool worked by accessing a volunteer's "social graph" on Facebook (with their permission) and matching the data about their friends to the campaign's own massive database of voting age adults. The team could then rate those friends based on how likely they were to vote for Obama, how likely they were to be registered and how persuadable they were by their peers. A list of people who fell into the sweet spot of "likely Obama voter" and "likely to vote" was then sent back to the volunteer so that they could directly and personally contact them with information about registering to vote, campaign materials or volunteering possibilities.
The main goal, Ghani said, was to explicitly define the amount of influence a person has on their social media network and harness it for campaign goals. Every time the team tried an algorithm for suggesting potential "targets" to volunteers on Facebook, they compared its performance against a random selection of friends in their network to conduct an experiment on the fly that could help improve their methods for the next round.
"Experimentation was a key strategy for us to not only do something better, but really get to a point at the end where we were confident in what we were doing," Ghani said. "When you talk to typical people doing social media, they talk about influence as a high-level abstract concept. For the campaign, influence was very concrete -- if a person asks you to do something, how likely are you to do that?"
You can watch both Insley and Ghani's talks on demand by registering (for free) and signing into Big Data Week's video player. Check out the rest of the week's talks at their Chicago site or the main page for the event.