Pick a corpus, any corpus

March 13, 2011

Earth, Epistemology, Fourthestate, Freeculture

A few weeks ago I participated in a brainstorming session exploring the kinds of academic research projects the WikiLeaks archives might generate. Beyond the substantive specifics of the leaked cables, the media coverage of Cablegate, and their impact on geopoltics, a central concern we recognised is the challenge of transforming torrents of qualitative data into narratives, arguments, and evidence . The impact that technology is having on what’s knowable and how we go about knowing is a theme I have been chewing on for years – one that goes well beyond journalism, and cuts across the social sciences, law, education, etc. There is an urgency to this problem since the tools and techniques involved in these analyses are unevenly distributed. High-end corporate law firms, marketing agencies, and political parties are all embracing new approaches to making sense of petabytes. Unfortunately, impact law firms, social scientists, and journalists often don’t even know these tools exist, never mind how to use them. Part of what I call the organizational digital divide. During our brainstorming I formulated a new twist on a possible research agenda. I realized how daunting it has become to evaluate and calibrate the emerging suites of digital instruments. There are many digital tools emerging that can be used to analyze large troves of data, but it is difficult to determine what each tool is best at, and if it does its job well. One good way to benchmark our digital instruments is to select a standard corpus, and spend lots of time researching and studying that corpus until the corpus is fairly well understood. Similar to the role that the Brown Corpus played in computational linguistics, data miners need a training ground we can test, hone, and sharpen our digital implements. If we bring a new tool to bear on a well understood archive, we can evaluate its performance relative to our prior understanding. Currently Wikipedia serves as the de-facto benchmark for many digital tools, though, since its a moving target, it is probably not the best choice for calibration. In many respects the selection of this kind of corpus can be arbitrary, though it needs to be adequately sophisticated, and we might as well pick something that is meaningful and interesting. The Wikileaks documents are an excellent contender for training the next generation digital instruments and data miners. The AP is hard at work on new approaches for visualizing the Iraq War logs, and just last week there was a meetup for hacks and hackers working on the wikileaks documents Data Science & Data Journalism . It is easy to see how Knight funded projects like DocumentCloud converge on this problem as well. Ultimately, I think these efforts should move in the direction of interactive storytelling, not merely an passive extraction of meaning. We need tools that enable collaborative meaning-making around conceptual space similar to what Ushahidi has done for geographic space.