DataIO

A few months ago, while a solution for the MTA budget shortfall was being debated by the New York State Senate,  The Open Planning Project helped parse MTA budget data into a machine searchable format. The MTA originally published the budget as a PDF. To extract the data I used a utility called pdftohtml to convert it into an XML document. I then used the python library lxml to convert the document into a set of csv files. The results of this labor can be seen on TOPP’s data site.

Soon after I published this data I was told by a number of people that the data would be more useful if presented in another format. At first I just started creating a bunch of command line python scripts that would suck in these csv files and spit them out in different formats. I quickly realized that I could accumulate these scripts and create a quick and dirty web application.

Over a few train rides I created an application called DataIO, and this week I finally got a chance to upload it to Google App Engine.  Specifically I received three requests for data in different formats.  I’ll give examples using the data set containing the MTA’s annual labor expenses.

  • JSON for Flot:
    Flot is a great javascript graphing library, but it’s not that easy to convert a CSV file into a Flot friendly format. After creating a data set in DataIO, you can request the data back in a JSON dictionary that can be plugged directly into Flot. For our Labor Expenses example this means simply constructing a URL such as this one:

    http://www.dataio.org/data/Wfb?format=flot&base_column=0&base_row=0

    The “base_column” query string parameter represents the column in the CSV file that will used for the legend of the graph. The “base_row” represents the row in the CSV file that contains the values for the x-axis of the graph.

    It’s not obvious how that JSON will display, so DataIO allows you to preview the graph by adding a “preview” query string argument:

    http://www.dataio.org/data/Wfb?format=flot&base_column=0&base_row=0&preview=true

  • Google Charts:
    Flot is great, but it’s not always the right solution. For example, if I wanted to add the Flot graph generated above into this blog post, I would have to load three javascript files onto this webpage. Google Chart offers a better solution for this use case: it creates this chart as an image, which can be included into a blog post without having to use Javascript. To construct the Google chart for our Labor expenses example, we can send DataIO the following request:

    http://www.dataio.org/data/Wfb?format=gchart_line&base_column=0&base_row=0

    which returns the URL for the following image:

    http://chart.apis.google.com/chart?cht=lc&chs=800x300&chd=e:yf1K2t4H5P6C,KdKjKWKbKiKq,MDM3NkOVPGP7,H8InI9JUJwKO,O1PRP-QLQSQw,KBKZKfKrK1K6,BgBEA6A-BAA8&chtt=Labor%20Expensed%20by%20Category&chdl=Payroll|Overtime|Health%20%26%20Welfare|OPEB%20Current%20Payment|Pensions|Other-Fringe%20Benefits|Reimbursable%20Overhead&chco=DC143C,912CEE,0000FF,00EE00,FFFF00,FF8C00,000000&chxt=x,y&chxl=0:|2007|2008|2009|2010|2011|2012|1:|-400.0|950.0|2300.0|3650.0|5000.0

  • Data multiplied by a factor:
    The MTA publishes all of their financial data in millions of dollars. Often it is useful to see the data in other units, such as dollars:

    http://www.dataio.org/data/Wfb?format=html&multiplication_factor=1000000&multiplication_start_row=1

    or in millions of Euros:

    http://www.dataio.org/data/Wfb?format=html&multiplication_factor=0.734&multiplication_start_row=1

    The number to multiply by is sent in via the multiplication_factor argument and the multiplication_start_row tells DataIO not to multiply the first row by the factor.

A complete list of query string arguments that can be used to interact with DataIO are located on its front page.  The code for this application is hosted at bitbucket.