The OpenPlans Guide to GTFS Data

Data formatted using the General Transit Feed Specification (GTFS) runs practically every transit app out there. It fuels all kinds of products in our space from trip planners to transit-oriented apartment searches. The folks at City-Go-Round tell the story of open transit data pretty well – open transit data enables the transit apps that help people get around. But here's the next question: what else can we do with that data? We mentioned a few weeks ago that it turns out that almost 85 percent of the transit miles traveled in the U.S. are done so on transit systems with open data. So let's talk about the tools that we could build to help out those transit agencies.

What's a GTFS?

GTFS is just a data format. When data follows those guidelines, it's called a GTFS feed. GTFS feeds give you an incredible amount of information about a transit system's routes, stops, schedules, fares, transfers. The beauty of it is that it starts at the most disaggregated level (the arrival and departure time of every stop of every bus) and categorizes data upwards with a structure resembling a relational database with fields and rules to connect tables as primary and foreign keys would.  The highly refined data is needed for a trip planner that tells you exactly when to leave your house  to catch a bus. Hopefully the above infographic can help you if you're just starting with GTFS data.

How else can we use it?

Civic-minded programmers can build apps for more than just transit riders. As civic-minded programmers here at OpenPlans, that's just what we did. Over last two months, we were working on a project that I started as part of my intern experience called GTFS Explore. If you're involved in academia, as a consultant for transit agencies or like transit data, you've probably heard of the Transit Capacity and Quality of Service Manual (TCQSM). This hefty volume provides instructions on how to quantitatively review everything from the person-capacity of a rail transit line to the quality of a bus stop. The problem is that the data to run these analyses is often challenging to get to. As it turns out, GTFS feeds are actually a highly refined, full coverage, up-to-date source of data. Here are a few examples of the service coverage analyses you could run on any transit system's data.

System Level Analysis

By bringing GTFS into an actual database, we could use simple queries to find out what every headway is in a transit system. This seems trivial, but how would you do it without GTFS? A headway is the time between successive arrivals of a transit vehicle from one route at a specific stop. We averaged the headways throughout an entire day for each route-stop in a bus system leaving us with 44,800 data points. Imagine counting those off a PDF of the schedule.  Looking at a distribution of these, we can see some interesting characteristics. For one thing, there are very few stops where a route is average 10-minutes or less for most of the day. Two common values are between 15-20 minutes and 60 minutes, which makes sense when you think about how some routes offer hourly service all day and others might have a mix of half-hour and shorter headways during peaks.

One thing to keep in mind is the importance of your decisions about aggregation. As I mentioned, there are many ways to look at this data. Consider a route that runs buses every 15 minutes for two hours in the morning and two hours in the evening peak hours, but only hourly service for the rest of the day. An average headway would be 40 minutes which is pretty misleading since it's way worse than 15 minutes but better than 60 minutes. There's no easy answer when you're talking about boiling it down to one number. What would standard deviations do for us? Is it better to take median values? A warning: think hard about aggregation before boiling down your data. We never used to have data at this refined a level, so we have to be responsible with how we use it.

Multi-Agency Comparisons

I have a hunch that there won't be many agencies that use database queries for an introspective analysis of their own services. For one thing, many agencies have custom tools and custom data to do that on their own. But as a grad student, I am often curious about the comparison of agencies to one another and national trends. What are the longest bus routes in the U.S.? What agency has the best service frequencies? How does average stop spacing correlate with land use density? Is route directness a function of street-connectivity in suburban neighborhoods? GTFS is one component of these analyses that can be used to really speed up the process of gathering data from around the country.

We wanted to show the power of batch processing and extrapolation of our methods with this next run. We took the idea of an average-agency headway (the average of all those 45k data points from above) for each of fifty large agencies and plotted them to spot trends by mode. Bus systems exhibit some of the longer headways, matched only by long-haul commuter rail systems. Metros/Subways all stay at or below 20 minute headways throughout the day, speaking to the service quality that is ensured with such high investment in the mode (you rarely see a city that has built a subway and doesn't run it with at least 20-minute headways). There's a LOT of data behind each of these graphics; this is a simple example that really doesn't do justice to the fidelity of the data used here.

The method: a shell script ran each of the feeds through a set of tools (GTFS Transformer, GTFS SQL Importer and PostgreSQL) to generate output files that we pulled together in R. To do that fifty times with some of the largest agencies (NY, Chicago, Boston, LA etc) took 6 hours on a remote server with 15 GB of RAM.

Build this next:

This was just the beginning of the opportunities for using GTFS as an analytic tool. One idea I'd like to see to life is an auto-generated infographic  that resembles the beauty of the faberNovel infographic on New York City transit ridership. It lets you click on any major city and generates the data summaries that you've read about in the post along with other indicators. Maybe it runs a geographic analysis of transit accessible areas and shows you a shadow of the service area; maybe it it takes inspiration from the race and ethnicity maps by using census data;  perhaps it shows the foursquare places near the most popular stop in the system. There are many fun and helpful analytics that we can conceive using this data other than for transit trip planning.

What are your ideas? What can you build with this?