0

I have 2 (for presentation simplified) Dataframes:

table1_id lat long table2_id
1 5.5 45.5
2 5.2 50.2
3 8.9 49.7
table2_id lat long
1 5.0 47.2
2 8.5 22.5
3 2.1 33.3

Table1 has >40000 rows. Table2 has 3000 rows.

What I want is to find the table2_id for each item in table 1 which has the shortest distance to its location using latitude/longitude and for the distance calculation I am using geopy.distance.

To do this the slow way is to iterate over each coordinate in table1 and for each of those iterate over all rows of table 2 to find the minimum. Which is very slow using DataFrame.iterrows() or DataFrame.apply.

Would look somewhat like that:

for idx, row in table1_df.iterrows():
    location1 = (row["lat"], row["long"])
    min_table2id = 0
    min_distance = 9999999
    for idx2, row2 in table2_df.iterrows():
        location2 = (row2["lat"], row2["long"])
        distance = geopy.distance.geodesic(location1, location2).km
        if distance < min_distance:
            min_distance = distance
            min_table2id = row2[table2_id]
    row[table2_id] = min_table2id

I've only done simple things over smaller dataset where speed was never a problem, but this is going for minutes, which was somewhat expected by those 2 for loops over that large of a table.

I am not too familiar with vectorization (only used it to manipulate single columns in a dataframe) and was wondering if there is a good way to vectorize this, or speed it up in another way. Thanks!

TomK
  • 27
  • 5
  • Just sort the values before searching so you know where to stop checking. Just use the data ... Why Pandas??? Sorting will speed the calculations up. Maybe you sort your rows into a grid of squares to speed it up even further. Or use product ... and search for a minimum ... – Claudio Nov 28 '22 at 00:01
  • Your options are probably to try a nested list comprehension or a .apply(). It is almost always a good idea to avoid iterrows since this is a very slow implementation. – tim654321 Nov 28 '22 at 00:46
  • https://www.degruyter.com/document/doi/10.2478/s13537-013-0110-4/html An all-pairs shortest path algorithm for bipartite graphs – Claudio Nov 28 '22 at 01:07
  • https://stackoverflow.com/questions/68910543/minimum-distance-between-two-sets-of-points Minimum distance between two sets of points – Claudio Nov 28 '22 at 01:12
  • Are you comparing a path of an ultrarunner running 10 hours long to a path of a car driving the same path? – Claudio Nov 28 '22 at 01:17

1 Answers1

1

As far as I understand, the geodesic function does not support vectorization. Therefore, you need to implement the distance calculation function for coordinate vectors yourself. Fortunately, there are many such implementations. Here is a very simple implementation. Here is a complete example of the solution to your question. I have created two dataframes with the dimensions you provided.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'df1_id':range(40_000), 'lat':np.random.uniform(-10, 10, size=40_000), 'lon':np.random.uniform(-10, 10, size=40_000)})
df1['df2_id'] = np.nan

df2 = pd.DataFrame({'df2_id':range(3000), 'lat':np.random.uniform(-10, 10, size=3000), 'lon':np.random.uniform(-10, 10, size=3000)})

def haversine(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = np.radians([lon1, lat1, lon2, lat2])
    dlon = lon2 - lon1
    dlat = lat2 - lat1

    haver_formula = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2

    r = 6371  #6371 for distance in KM for miles use 3958.756
    dist = 2 * r * np.arcsin(np.sqrt(haver_formula))
    return dist


def foo(row, table2_df):
    size = table2_df.shape[0]
    distances = haversine([row[1]]*size,[row[0]]*size, table2_df.lon.values, table2_df.lat.values)
    return table2_df.iloc[np.argmin(distances), 0]

%%timeit
df1[['lat', 'lon']].apply(foo, axis=1, args=(df2,))
Out:
     14 s ± 251 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

You can also parallelize the apply method using the parallel-pandas library. It's very simple and doesn't require you to rewrite your code.

# pip install parallel-pandas
from parallel_pandas import ParallelPandas
ParallelPandas.initialize(disable_pr_bar=True)

# p_apply is parallel analogue of apply method
%%timeit
df1[['lat', 'lon']].p_apply(foo, axis=1, args=(df2,))
Out:
     1.79 s ± 75.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Thus, the total time takes just over a second. I hope this is acceptable to you. Good luck!

padu
  • 689
  • 4
  • 10