Python create all combinations of data points and filter them based on function

Question

I have a table of locations (right now in dataframe) and want to calculate all combinations and their distance from eachother.

Input:

ID	Lat	Lon
1	6,4355	53,2245
2	5,3434	50,2345
3	4,3434	51,2345

Desired Outcome:

ID1	ID2	distance
1	1	0
1	2	1
1	3	2
2	1	0
2	2	3
2	3	4
3	1	0
3	2	5
3	3	6

def distance(lat1, lon1, lat2, lon2):
       lat1 = radians(lat1)
       lon1 = radians(lon1)
       lat2 = radians(lat2)
       lon2 = radians(lon2)
       dlon = lon2 - lon1
       dlat = lat2 - lat1

       R = 6373.0
       a = sin(dlat / 2) ** 2 + cos(lat1) * cos(lat2) * sin(dlon / 2) ** 2
       c = 2 * atan2(sqrt(a), sqrt(1 - a))
       return round(R * c)

Right now i loop through the dataframe 2x in such an ugly way that i'm not even going to show, but it works. Problem is that it is terribly slow when the table gets big and i know there must be a faster way to do this.

If i can do this in standard python/pandas/numpy (as long as its fast and i dont have to use obscure packages!) Any help would be much appreciated Oh and i want to filter on distance < 10km, forgot to add!!

Here my current code i want to improve:

df_distance = pandas.DataFrame(columns=['ID1', 'ID2', 'distance'])

""" first all id with themselves """
for index, row in df.iterrows():
    df_new_row = pandas.DataFrame([{'ID1': row['ID'], 'ID2': row['ID'],
                                    'distance': 0, 'lat1': row['Lat'], 'lon1': row['Lon'],
                                    'lat2': row['Lat'], 'lon2': row['Lon']}])
    df_distance = pandas.concat([df_distance, df_new_row])


for index1, row1 in df.iterrows():
    for index2, row2 in df.iterrows():
        if index2 > index1:
            dist = distance(row1['Lat'], row1['Lon'], row2['Lat'], row2['Lon'])
            if dist <= 10:  # filter at lower than 10km
                """ add both directions """
                df_new_row = pandas.DataFrame([{'ID1': row1['ID'], 'ID2': row2['ID'],
                                                        'distance': dist, 'lat1': row1['Lat'], 'lon1': row1['Lon'],
                                                        'lat2': row2['Lat'], 'lon2': row2['Lon']},
                                               {'ID1': row2['ID'], 'ID2': row1['ID'],
                                                'distance': dist, 'lat1': row2['Lat'], 'lon1': row2['Lon'],
                                                'lat2': row1['Lat'], 'lon2': row1['Lon']}
                                               ])
                df_distance = pandas.concat([df_distance, df_new_row])

Is the table static, or values are being added to it during runtime? — Yuv_c, May 12 '22 at 07:23

Guaz · Answer 1 · 2022-05-12T07:35:10.337

Generally use from itertools import combinations.

Example in that case:
>>> from itertools import combinations
>>> a = [1, 2, 3, 4, 5]
>>> for c1, c2 in combinations(a, 2):
...     print(c1, c2)
...
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5

Of course you can use 'key', list comprehensions etc. to get correct values depending on your input, but programming is still about solving puzzles - you now have everything you need :)

Little offtop-warning:

Calculating all combinations (full routes) is O(n!), which basically means if you have ~ > 30 points (depending on your computer), forget about calculating it in your lifetime. But it should be fine for each pair, depending how many of them you have, but it's just O(n2) :)

@Edit: Generally you won't reduce O(n2) complexity, but generating numpy two dimensional matrix and calculate distance across this structure will speedup the process a lot, because numpy pushes slices of data to processor cache which is bottleneck in regular iterative problems. Consider if data exceeded your RAM it'll be slow anyway, so you should ensure your data to calculate is as compact as possible for large data, to don't hold anything that's unnecessary.

Other thing you might consider, is just use popular methods to do this in more threads, just split the data and merge it at finish.

Generally you can find a book in google about optimizing your code in 'low level' ways :)

Thanks man! Right now i have 500 locations to it goes up fast. so doing to distance calculation took me like half an hour. And i have to filter on max 10km distance(just added). is looping this combinations result list and calling for def distance much faster than a nested loop? — Jop, May 12 '22 at 07:35
Unfortunately not much, your bottleneck is distance calculation, sqrt is most heavy thing, when you filter result, you can do it during calculation. So I would say Threading and numpy are the core of what you need. You can consider also to move your calculation to C: https://towardsdatascience.com/write-your-own-c-extension-to-speed-up-python-x100-626bb9d166e7 Especially when you rounding the result, you can implement faster sqrt just to calculate integer. Also you can exit calculation after `a` is to big, and after sqrt will be above 10km anyway, skip sqrt that way. — Guaz, May 12 '22 at 07:39

score 0 · Answer 2 · answered May 15 '22 at 16:30

@jop I was unable to answer on my own, so I formulated your question differently, please check this solution:

Fastest way in numpy to get distance of product of n pairs in array

Filtering results to filter efficiently every result above 10km could be easily done by this code fragment:

def filter_where(result, var=11):
    return result[np.where(result < var)]

To load data from pandas dataframe you can follow these topics:

Selecting multiple columns in a Pandas dataframe

Convert Select Columns in Pandas Dataframe to Numpy Array

I hope it solves your performance issue, please let know in comments how solving puzzles went :)

To solve puzzle you need only know how to mark correct indices to don't lose your data which point with which is combined.

PS: I believe that should be at least 60 times faster then your current solution.

Python create all combinations of data points and filter them based on function

2 Answers2