Does usefully open data have to mean XML?

I’ve been having some discussions with people at the Chicago Open Government group, talking about data openness. One common complaint all around is about data exported as PDFs. The particular topic we were discussing was TIFs. TIFs (Tax Increment Financing) are something a city can use to try to improve a neighborhood, and fund the improvements with the increased tax revenue from rising property values in those neighborhoods. These are used in many cities, and seem generally surrounded by an air of controversy.

Chicago in particular recently passed a law to open up TIF information. How the data is opened up isn’t specified that closely, and probably will be via published PDFs, along with some shape files to define the neighborhoods. (Shape files seem relatively easy to attain, probably because they are already most easily managed electronically.)

There was some vague talk about opening up the data as XML… but what would that even mean? To be fair to the city, the TIFs are actually defined by documents, and a PDF is a relatively accurate representation.

In general this idea of “XML” is confusing. XML is just a syntax for holding structured data. But there’s no particular structure that this data should conform to. There is MathML for talking about mathematical equations. There is KML for geographical information. But there’s no TIFML, no PolicyML, no GovernmentML. Though, somewhat surprisingly to me, there is a government sponsored StrategyML and what appears to be an aborted attempt at PlanningML. I have reservations about any markup language, which I’ll discuss below, but if people want these documents in StrategyML then that would mean something, XML is not that meaningful.

What is the purpose of opening up TIF data? Maybe:

  1. See how much money is redirected to the TIF
  2. See what that money is used for
  3. See any measurable outcomes of the TIF
  4. See the TIF charter, the document identifying what purpose the TIF is supposed to serve

There is some budgeting data that would be an excellent candidate for a structured presentation. But a substantial portion of the information is not structured. The charter has no structure, it is a narrative document. It is also essential context to understanding anything else. You can’t say that the budget is too big or that any one item is wasteful, except in relation to the purpose of the TIF, and that charter defines the purpose. A TIF zone set up to encourage tourism should be managed much differently than a zone where they are fighting urban blight, or encouraging light industry, or pursuing transit-oriented development.

Also there is the simple question of fact. A TIF is a political entity, set up by politicians, and it is a formal agreement. All the people involved work with documents. They do not write markup. The document means what was on paper. Extracting underlying semantics is not true to the process itself.  (In this I am quite influenced by the principles of Microformats.)

So, what to do? The answer I see is one of annotation, not structure. The document should be posted in as accessible a manner as can also be accurate. HTML, preferably as simple as possible, is an excellent candidate, nearly as representative as PDF but more accessible (though PDF allows you to guard against OCR errors by keeping the original scan more present). From there portions of the document should be tagged. If there is a commitment from the city, tag it as such. If there is an expected outcome, tag that. Make the document easy to reference in granular pieces, so people can discuss the details.

At some point there’s either a story worth telling with the data, or there isn’t. The story may be one of success, or one of corruption, or simply one that puts TIF financing in context with a city budget. But there’s no one answer about what you will get out of this information. You can’t dump all this data into a computer and tell it how well things are going.  Structure involves a rebuilding of the data, but when we don’t know why we want to rebuild the data, when we don’t know what we want to know, I believe the more distributed notion of annotation is a better fit.