Liberating Data at Journalism Data Camp NY

I spent last Friday at ScraperWiki’s Journalism Data Camp NY, hosted at the Columbia School of Journalism.

ScraperWiki is a really interesting project. There is a lot of data on the web that isn’t readily machine-readable. Turning this data into something that can be made more useful to the public – for example, by putting it on a map, or making it more easily searchable, or analyzing it for newsworthy trends – is difficult.

Data like this might be nominally open data, but you can’t really do anything with it.

photo by Kelly Fincham


For example, your local health department might have a website that lists restaurant inspection reports. But it’s just a pile of  HTML; it’s not searchable by date, it’s not sortable by the criteria you need, there’s no map, you can’t load it into a spreadsheet, you can’t derive statistics from it …  Typically, if you want to feed public data like that into some other software or website you have to write your own “scraper” script to download the data, parse out the information you need,  and put it into a spreadsheet or database or whatever format you need.  And then after you’ve moved on to the next project, and six months later the health department redoes their website, then your old client calls to ask why your code’s not working anymore.  Worse, if somebody elsewhere wants the same data, they have to do the same work.  It’s harder if the data is in a print-oriented format like PDF.  We’ve dealt with these issues a lot while working on OpenBlock.

ScraperWiki aims to change all that by making scraper scripts into public, collaborative resources that anyone can run, edit, copy, fix, and learn from.   They take care of running your script regularly, and provide conversion to several output formats, and can alert you when the script fails. If somebody else has already scraped some data you want, you can just use it.

Friday’s event was designed to get more data into the hands of journalists, by teaming people up into groups working on interesting data sets.  Journalist teams were thinking about what data they needed and what they could do with it if they had it, while programmers were taking their requests and putting scripts up on ScraperWiki.  There was also a training session for people wanting to learn how to scrape in Python or Ruby.

I headed straight for the “Liberate the Data” group, AKA the programmers’ corner, and after looking over the list of data sets people were interested in, I decided to scrape the New York City public school budgets.  Meanwhile, the guy next to me, Mike Caprio, was working on Iowa accident reports.

In a typical example of scraper-writing challenges, the NYC Department of Education website changed its markup every year, requiring lots of little tweaks to the parsing code.  Getting a list of school IDs to submit to the DOE site was also puzzling; luckily the kind people at GothamSchools knew where to look.  Just before heading home Friday I got it all working, and immediately learned that the result wasn’t very useful: it turned out there were about 2000 schools and 2000 budget line item categories, and even if your spreadsheet software could make a 2000×2000 sheet, how would you ever read it?   While I was on the way home, along came ScraperWikian extraordinaire Julian Todd who started trying to derive “major” categories from the disorganized set of categories, and more importantly made each row one category per year per school, rather than all categories per year per school. I hadn’t even looked for any such help, it just arrived unasked-for.

Having ScraperWiki staff on hand during the event was really useful, and not just for all the free refreshments they kept offering me.  CEO Francis Irving was easily approachable. It turned out I needed a Python library that could open Excel 2007 XLSX files; a quick search found that one existed, but ScraperWiki didn’t have it in their cloud environment.  I mentioned this to Francis and five minutes later it was available to all ScraperWiki users.

I hope they do more such events in the future and wish I could have stayed longer.

For tweets of the event, check out the #jdcny hashtag.