Data as journalism

Data should be a driving force in online journalism, writes Rich Gordon of Northwestern in a post for the Readership Institute. In his view, the Gannett newspapers are leading the way thanks to the company’s restructuring of its newsrooms into converged “information centers.” Why data?

  • Data is “evergreen” content. Its value to users does not end after 24 hours.
  • Data can be personal. What’s more relevant to someone than, say, reported crimes in their neighborhood, or nearby property assessments?
  • Data can best be delivered in a medium without space constraints. The most valuable databases (say, property assessments or state employee salaries) contain too much information to publish in print. And even when print publishing is practical (say, listing real estate transactions in zoned editions), the data will be much more valuable if they are accessible and searchable at the user’s convenience.
  • Data takes advantage of the way people actually use the Web.It’s a medium for active behavior — for instance, research and interaction — not passive activity like reading or viewing.
  • Data, once gathered, can be excerpted in print. Once you’ve done the work of acquiring, formatting and enabling online access to data, it is easy to pull information from the database for traditional publications.

Gordon is particularly impressed with the Indianapolis Star’s “data central,” where the paper houses searchable databases on everything from property taxes to public safety, from education to sports.

The Star doesn’t confine itself to just making data available, it also instructs readers on how to get it themselves: where to look, what data they’re likely to find, and even how to file an FOI request. Plus there’s a terrific Q&A with the paper’s database expert, Mark Nichols, who explains how he works with data.

My motto – or mantra – has always been “all data is dirty.” I think anyone working with raw data should take that to heart. To me, that means the dataset may not only have typos and other errors, but it also has limitations. In some cases, the dataset may be a snapshot in time and may not reflect what’s happening right now. In other cases, it may only be a sampling of people or incidents, and may not reflect the full picture. In all cases, it’s just one source of information, not a “be-all-end-all” source. So the best advice I can give people is to find out all you can about the dataset you’re working with – its strengths and weaknesses – so you can draw the best conclusions from it.

Gordon draws several lessons from the Star’s database experience.

  1. Have a plan. Figure out what data you can find that would be useful to your online users and make a list. Then go get it.
  2. Involve experts in computer assisted reporting. They know how to check data for entry errors that would skew the entire result.
  3. Put together a team. You need CAR expertise, research knowledge, programming skills and design abilities.
  4. Find tools that make it easy. The Star uses Caspio technology, making it possible for people without advanced skills to publish data online. Django, open-source software developed by journalists for journalists, is another option.
  5. Apply news judgment in deciding what and when to publish. Don’t just stick data on your site. Put it in context, explain why it matters. Do journalism to it, in other words.

Gordon’s column is worth reading all the way through, for a thorough analysis of what’s possible today in database publishing, from basic delivery to telling stories through data. Thanks to Lost Remote‘s Cory Bergman for pointing out the column.

Share