20

I'm working with some GTFS data and would like to be able to create a list of all stops associated served by a route. I don't really understand how to do with with GTFS data.

Trips.txt comes in a format like this:

route_id,service_id,trip_id,trip_headsign,direction_id,block_id,shape_id 1,A20120610WKD,A20120610WKD_000800_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_002700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_004700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_006700_1..S03R,SOUTH FERRY,1,,1..S03R 1,A20120610WKD,A20120610WKD_008700_1..S03R,SOUTH FERRY,1,,1..S03R

I tried reading in the matching shape using the shape_id and then looking for stops with matching latitudes and longitudes but that doesn't seem to work reliably. Does anybody know how to do this?

Cam Saul
  • 1,706
  • 1
  • 16
  • 24
  • 1
    That's a question that would fit perfectly with http://area51.stackexchange.com/proposals/49339/open-transportation-technology – gcamp Jan 27 '13 at 01:11

6 Answers6

46

As you've noticed, there isn't a direct relationship between routes and stops in GTFS. Instead, stops are associated with trips, where each trip represents a single "run" of a vehicle along a particular route. This reflects the fact a route does not necessarily serve every one of its stops at all times—on weekends it might skip stops outside a high school, for instance.

So getting a list of every stop served by a route involves combining several models:

  • routes.txt gives you the route ID for the route you're interested in.
  • trips.txt gives you a set of trip IDs for that route.
  • stop_times.txt gives you a set of stop IDs for the stops served on each of these trips.
  • stops.txt gives you information about each of these stops.

Assuming you're using an SQL database to store your GTFS data, you might use a query like this (once you've obtained the route ID):

SELECT stop_id, stop_name FROM stops WHERE stop_id IN (
  SELECT DISTINCT stop_id FROM stop_times WHERE trip_id IN (
    SELECT trip_id FROM trips WHERE route_id = <route_id>));

Remember, though, this will output a record for every stop that is ever served by the route. If you're generating schedule information for a rider you'll probably want to limit the query to only trips running today and only stop times with departures in, say, the next thirty minutes.


Update: I wrote the above SQL query the way I did as I felt it most simply illustrated the relationship between the GTFS models, but btse is correct (in his answer below) that a query like this would never actually be used in production. It's too slow. You would instead use table joins and indices to keep query times reasonable.

Here is an equivalent query, written in a way more suited to being copied and pasted into a real application:

SELECT DISTINCT stops.stop_id, stops.stop_name
  FROM trips
  INNER JOIN stop_times ON stop_times.trip_id = trips.trip_id
  INNER JOIN stops ON stops.stop_id = stop_times.stop_id
  WHERE route_id = <route_id>;

Typically you would also create an index for each column used in a JOIN or WHERE clause, which in this case would mean:

CREATE INDEX stop_times_trip_id_index ON stop_times(trip_id);

CREATE INDEX trips_route_id_index ON trips(route_id);

(Note that RDBMSes normally index each table by its primary key automatically, so there is no need to explicitly create an index on stops.stop_id.)

Many further optimizations are possible, depending on the specific DBMS in use and your willingness to sacrifice disk space for performance. But these commands will yield good performance on virtually any RDBMS without needlessly sacrificing clarity.

  • thank you. I should have been able to figure this out on my own, but I probably had about 100 lines of code to do what this 3 lines of sql does. – Alex Muro Feb 25 '14 at 20:41
  • This answer is partially correct. Yes you will get all the stops in a route (which is what the OP asked) but many routes have different branches and this query will return all branches at the same time. Still figuring out how to split branches myself. – Julian Feb 23 '15 at 21:00
  • 2
    Great JOIN query illustrating the relationships between tables in GTFS. However as @Julian has pointed out it falls short on determining branches. Also need to determine stop_sequence somehow as well. It would be great if you could append to the answer some details on how to best spit branches and determine sequence. Thanks! – AlexVPerl Aug 27 '15 at 22:59
  • This answer is good, but I have one question: where to get `routes.txt`, `trips.txt`, `stop_times.txt`, `stops.txt` files, for the region of consideration (East England in my case)? – mercury0114 Nov 08 '16 at 15:30
  • The limit with this solution is that you may not have the stops in the correct order. For exemple First trip = A : 8h00 | B : 8h10 | C : 8h20 ... Second Trip = B : 9h10 | E : 9h20 ... You know that E is after B, but you don't know if E is before or after C – david CHOLLEZ Nov 20 '22 at 20:23
11

I came across this post in my Google searches and I figured I would update it with a better answer in case anyone else stumbles upon it. The answer that Simon gave is 100% correct, however, the query he provided is quite slow for large GTFS feeds. Here is a query that does the same thing, but performs significantly faster.

Just to give you some anecdotal evidence, for a GTFS feed of about 50mb, Simon's query took anywhere from 10-25 seconds to complete. The statement below takes consistently < 0.2 seconds.

SELECT T3.stop_id, T3.stop_name 
FROM trips AS T1
JOIN
stop_times AS T2
ON T1.trip_id=T2.trip_id AND route_id = <routeid>
JOIN stops AS T3
ON T2.stop_id=T3.stop_id
GROUP BY T3.stop_id, T3.stop_name

UPDATE:

I realized I didn't mention this before, but of course you will want to have indexes where each of the tables are being joined.

btse
  • 7,811
  • 3
  • 25
  • 30
  • Could you please tell me how structure of the tables: stops, stop_times and trips should look like? Both queries take a lot of time to execute (20 sec or so) – dargod Sep 08 '13 at 14:21
  • @dargod The structure of the tables should follow the same structure that Google outlines in their developer guide which can be found here. https://developers.google.com/transit/gtfs/reference. You also want to make sure to have an Index at each column that is used for a join or used to make a selection. – btse Sep 09 '13 at 20:35
  • Thank you very much! So based on your query, I should have indexes at: stop_id, stop_name, trip_id, route_id? – dargod Sep 10 '13 at 20:03
4

If you GROUP BY shape_id when selecting from trips you can make the query even faster.

Using @btse's query to get the unique stops for two routes takes 1.147s.

My equivalent query takes 0.4s.

SELECT unique_stops.route_id, unique_stops.stop_id, stop_name, stop_desc, stop_lat, stop_lon
FROM
  stops,
  (SELECT stop_id, route_id
   FROM
     stop_times,
     (SELECT trip_id, route_id
      FROM trips
      WHERE route_id IN (801, 803)
      GROUP BY shape_id
     ) AS unique_trips
   WHERE stop_times.trip_id = unique_trips.trip_id
   GROUP BY stop_id) AS unique_stops
WHERE stops.stop_id = unique_stops.stop_id
Luqmaan
  • 2,052
  • 27
  • 34
0

If you're working in R you could do this to find routes that stop at your target destination X:

require(dplyr)

routesX <- routes %>%
  left_join(trips %>% select(trip_id, route_id, shape_id)) %>%
  left_join(stop_times %>% select(trip_id, stop_id)) %>%
  semi_join(stops %>% filter(grepl('X', stop_name, ignore.case = T)), by = c('stop_id' = 'stop_code')) %>%
  select(names(routes), shape_id) %>%
  unique 
clancy
  • 182
  • 1
  • 3
  • 10
0

If the direction of the stop is needed, a change in Lukmaan's answer should be done:

SELECT unique_stops.route_id, unique_stops.stop_id, stop_name, stop_desc, stop_lat, stop_lon, unique_stops.direction_id
FROM
  stops,
  (SELECT stop_id, route_id, direction_id
   FROM
     stop_times,
     (SELECT trip_id, route_id, direction_id
      FROM trips
      WHERE route_id IN (801, 803)
      GROUP BY direction_id
     ) AS unique_trips
   WHERE stop_times.trip_id = unique_trips.id
   GROUP BY stop_id, direction_id) AS unique_stops
WHERE stops.stop_id = unique_stops.stop_id

If you add also stop_times.stop_sequence the same way, and order by direction and stop_sequence, the stops will be sorted as they are in the trip.

True Soft
  • 8,675
  • 6
  • 54
  • 83
-1

If you use "onebusaway", there is a quick way to do this without touching GTFS

Lets say you want to know the bus stops for bus route "M1" in Manhattan, NYC

http://bustime.mta.info/api/where/stops-for-route/MTA%20NYCT_M1.json?key=yourapikey&includePolylines=false&version=2

will give you a json feed then you can extract bus stops for both direction on route M1.

hatirlatici
  • 1,598
  • 2
  • 13
  • 24