How to store public-transportation data

Question

I am currently trying to implement my own public transportation path-finder to find connections by tram/bus etc. with given timetables. All the data is generated by me (by simply adding stops coordinates from google map). Thanks to it, I am free to choose my own way of storing data and processing them. The whole transportation network is represented by a weighted graph. So here comes the question: how to store public transportation data in standard SQL database so that it could be easily processed by some chosen algorithms? How to easily convert it later to the form of time-expanded graph so that simple Dijkstra algorithm would suffice?

Related: [Pathfinding (routing, trip planning, …) algorithms on graphs with time restrictions](http://stackoverflow.com/q/7245840/572670), The linked thread discusses how to find shortest path in such data, which seems related (though not a dupe!) to what you are after. — amit, Mar 19 '15 at 11:44
As an aside, [A*](http://en.wikipedia.org/wiki/A*_search_algorithm) *might* be a better route searching algorithm, as this is the basis for most journey planning systems... — Rowland Shaw, Mar 19 '15 at 11:44
Yup, I am aware that Dijkstra is slow. However, I need any simple algorithm as a benchmark for more sophisticated ones. — TheMP, Mar 19 '15 at 12:47

dube · Accepted Answer · 2023-02-14T23:55:19.310

Since I wrote a bachelor thesis about "how far can you get in X minutes using only public transport", I can share some insight on your problem.

The Problem

First and foremost, forget about traditional algorithms. Go for time-aware ones. What works for regular road networks, totally fails on time restricted graphs. Time awareness is one problem, but there are many more which you would never think about as usual passenger

some trains are guaranteed to wait for other trains
you have some minimum downtime when switching trains (from train a to b)
the downtime depends on the station (big or small stations) and the platform (switching from platform 1 to 2 is faster than 1 to 20)
train schedules differ depending on the date, your data table has a lot of entries which do not apply to your selected date.

Solutions

Judging by your nickname I assume you are able to read german. You can read my analysis of algorithms and my database setup in my thesis document. Page 49 shows the database model and page 50 the inmemory model. Also take a look at the references on page 55-57 (they are mostly english).

Even time-aware-dijkstra is really slow. A good algorithm is RAPTOR (scientific description with example can be found here). RAPTOR makes use of the timetable patterns instead of being hindered by it.

How RAPTOR works: Timetables mainly consist of stations (location), rides (one run of a single train), stops (combo of ride,time,location). The trick of raptor is to organize all rides into routes. A route may only contain rides which have the same sequence of stops and do not overtake each other. This way you can take the first ride of a route which matches your time and omit checking all the other rides on the same route (because they are guaranteed to arrive later). The amount of routes is considerably smaller (factor 1000 in my data) than the amount of rides. In addition, the RAPTOR algorithm runs in "train-hopping-cycles", which enables it to check a single station exactly once (dijkstra cannot) and to follow

It goes like this:

reachableStations = (startStation,timeX)
for i=0; i < maxHops; i++ {
     if( reachableStations contains targetStation)
         finished!
     usableRoutes       = collect all routes leaving from reachableStations 
     reachableStations  = follow all usableRoutes to the end 
                          and collect stations for the next cycle.
                          keep station-time combos for each find.
}

.

For my application I used a modified version of RAPTOR, which I named REAS. It's optimized for getting data about the complete network (instead of looking up a single route). You should simply stick to RAPTOR. One of the key learnings about algorithms in public transport is that the problem is a lot harder than it looks.

My app can be seen here. It uses the HAFAS data of the swiss railway company (SBB & ZVV) but currently the data is only from 2014. The calculation itself is fast, what takes a lot of time is generating the graphical overlay.

If you have more questions, feel free to contact me directly (contact info is available on ttm.ti8m.ch)

Thank you a lot for your suggestions. Unfortunately, my German is too poor to cope with Bachelor's Thesis language, although your article about RAPTOR algorithm is really helpful. I have been thinking about implementing TRANSIT algorithm (based on Dijsktra), but yours may work a whole lot better. I will of course contact with you if any more questions will arise in upcoming months and thank you for this possibility. — TheMP, Mar 19 '15 at 15:17
Dube, Have you considered RRaptor I am stuck on how exactly this might be implemented and how similar routes i.e from different companies leaving in the same /exact time range can be considered in the solution? — , Mar 27 '15 at 20:21
I'm not sure if RRaptor is a typo (did you mean Raptor)? I successfully used Raptor. Route is an virtual concept, it does not care about which company a ride belongs to. You need to build a route table in your database and build the routes yourself. I did that with an java program, which loaded all rides and then wrote to the routes to the DB. The rules for membership of a route are: all rides on a route have exactly the same sequence of stops. No ride on the same route may overtake an other ride. If one of those 2 rules is not true, then the ride has to be in an other/new route. — dube, Mar 30 '15 at 08:14
Here is the code for the route builder: http://pastebin.com/xd5RKkuB It's specific to a MySQL Database and almost standalone, all you need to build yourself is the ConnectionManager (which initializes and closes DB connections). The trick is to loop through all rides and look at their first station to find candidates for a route. — dube, Mar 30 '15 at 08:24
Yes sorry for the typo I meant rRaptor which stands for Range Queries: rRAPTOR as indicated in the link you provided. Have you tried to implement this? — , Mar 30 '15 at 08:39
Oh! Yes, thats what I did. I combined it with a modification to the abortion-criteria, since I wanted to find out how far I can get, the "where to" was not important. All variants of Raptor are based on this routing-table, which you need to build. It's the thing that makes Raptor superior to other algorithms. — dube, Mar 30 '15 at 08:58
Lol dude love ur comment on revival doom on the other post... Anyways so bottom line is your going to need a hella of alot more information then what some transport i.e. gtfs etcdata provides.. Stop grouping, inter charges transfer matrix. Etc.. — Superfy, Jun 01 '15 at 11:49
@dube is there any chance for you to opensourcing or showing your RAPTOR code? — Esse, Feb 10 '16 at 15:00
Good question! I cannot give you a yes/no right away. I'm not sure what the legal basis is here, since my university and my employer potentially do have a legal stake here. But if you need it _right now_ you can reverse-engineer it from my bachelor thesis, page 42 (it's a to read this math stuff, I know. Sorry.) — dube, Feb 10 '16 at 15:24

How to store public-transportation data

1 Answers1

The Problem

Solutions

Linked