Dataframe iteration runs forever

Question

I would like to create the link list of the transport routes from a GTFS database snapshot. Essentially, I have to create a list of bus lines, where you can switch lines through a common stop.

I have the stops precleaned in a dataframe:

stops = pd.DataFrame({'stop_id':  ['002133', '003002', '003118', '003209', '004521', '004716', '004903', '006390', '007177', '007289'], 
                      'stop_name':['Örs vezér tere M+H, déli tárolótér', 'Puskás Ferenc Stadion', 'Óbuda, Bogdáni út', 'Batthyány tér', 'Puskás Ferenc Stadion', 'ÉD metró járműtelep,porta', 'Örs vezér tere', 'Cinkota kocsiszín', 'Csepel kocsiszín', 'Vihar utca'],
                      'location': [(47.500366, 19.1357), (47.500368, 19.103406), (47.551471, 19.041971), (47.506776, 19.039318), (47.50017, 19.104773), (47.469651, 19.12909), (47.503585, 19.137192), (47.519345, 19.217072), (47.421498, 19.066247), (47.434399, 19.035664)]})

... and I tried to generate a new list of dictionaries, which compares every row to every row of the dataframe:

data = [{'stop1': row1[1],
        'stop2': row2[1],
        'stop1_id': row1[2],
        'stop2_id': row2[2],
        'stop1_location': row1[0],
        'stop2_location': row2[0],
        } for row1 in zip(stops.location,
                          stops.stop_name,
                          stops.stop_id) \
        for row2 in zip(stops.location,
                        stops.stop_name,
                        stops.stop_id) \
        if (row1[1] != row2[1] and geopy.distance.geodesic(row1[0], row2[0]).meters < 500) or row1[1] == row2[1]]

the output would be similar to this:

stop1, stop2, stop1_id, stop2_id, stop1_location, stop2_location
'Örs vezér tere M+H, déli tárolótér', 'Örs vezér tere', '002133', '004903', (47.500366, 19.1357), (47.503585, 19.137192)

If two stops has different names, and their distance is less than 500 meters OR their names are the same THEN append their properties to a list of dicts.

The problem is that this runs forever, regardless of whether the board has only 6000 rows. How can I optimize the code, so it doesn't run for ever? Is there a better solution for this?

My final goal would be to do a graph visualization of them with networkx.

"I think the goal is described well by the code" I often think this, then realize that's only because I'm actively working on the problem in question. Please take a look at [How to make good pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) and [edit] to include a sample of your input data and your expected output based on that input to make a [mcve] — G. Anderson, Nov 10 '22 at 16:58
This is an example of a time not to use list comprehension IMO. How far does it make it? Also, what is the board? Is it `stops.location`? Because if so, you're iterating over it twice which isn't 6000 then, it's 6k*6k=36m — Chrispresso, Nov 10 '22 at 17:01
Also how long does `geopy.distance.geodesic(row1[0], row2[0])` take on a single entry? — Chrispresso, Nov 10 '22 at 17:04
@Chrispresso yes, it's a nested loop, so its 36 million. The geodesic distance calculation is 120 µs ± 640 ns per loop. — Erik Gebhard, Nov 10 '22 at 17:14
"The problem is that this runs __forever__" is dubious. There's nothing I can see here that would put this into an infinite loop. It's much more likely that Chrispresso is correct, and you'll need a better algorithm. — Adam Hoelscher, Nov 10 '22 at 17:14
@AdamHoelscher sorry for the poor wording, it's not forever, but takes a lot of time (i did not wait for it to end). I understand that I would need a better algorithm, but how? — Erik Gebhard, Nov 10 '22 at 17:17
@ErikGebhard 36E6*120E-6 is still going to take 1.2 hours. Is that forever for this or was it running longer? Also you are currently double counting stops (A, B) since you're not checking if that location has already been found. You're really looking for the upper/lower diagonal of a matrix. So you should only have to do this for ~18m instead of the 36m you're checking. That being said, you can check if the `set([row[1], row[2]])` has already been found and skip it. That should help for starters. — Chrispresso, Nov 10 '22 at 17:18
Actually, an even easier way.... go from `i` from `[0, len(stops.location)-1)` and then `j` from `[i+1, ...]` and that should be just the upper diagonal for you. — Chrispresso, Nov 10 '22 at 17:25
@ErikGebhard: 2 ideas jump to mind. 1) You're worried about geodesic distance, which is harder to deal with than Euclidian distance, but similar techniques may be helpful. 2) There are probably medium difficulty ways to sort/filter pairs of points so that you don't have to calculate the distance for every pair. https://stackoverflow.com/questions/56214082/given-coordinates-of-points-find-all-pairs-of-points-that-exist-within-a-certai may help. — Adam Hoelscher, Nov 10 '22 at 17:27
@ErikGebhard you should also take a look [here](https://stackoverflow.com/questions/35296935/python-calculate-lots-of-distances-quickly). Seems like a couple answers there might solve your problem — Chrispresso, Nov 10 '22 at 17:32

Dataframe iteration runs forever

0 Answers0