I have 2 (for presentation simplified) Dataframes:
table1_id | lat | long | table2_id |
---|---|---|---|
1 | 5.5 | 45.5 | |
2 | 5.2 | 50.2 | |
3 | 8.9 | 49.7 |
table2_id | lat | long |
---|---|---|
1 | 5.0 | 47.2 |
2 | 8.5 | 22.5 |
3 | 2.1 | 33.3 |
Table1 has >40000 rows. Table2 has 3000 rows.
What I want is to find the table2_id for each item in table 1 which has the shortest distance to its location using latitude/longitude and for the distance calculation I am using geopy.distance.
To do this the slow way is to iterate over each coordinate in table1 and for each of those iterate over all rows of table 2 to find the minimum. Which is very slow using DataFrame.iterrows() or DataFrame.apply.
Would look somewhat like that:
for idx, row in table1_df.iterrows():
location1 = (row["lat"], row["long"])
min_table2id = 0
min_distance = 9999999
for idx2, row2 in table2_df.iterrows():
location2 = (row2["lat"], row2["long"])
distance = geopy.distance.geodesic(location1, location2).km
if distance < min_distance:
min_distance = distance
min_table2id = row2[table2_id]
row[table2_id] = min_table2id
I've only done simple things over smaller dataset where speed was never a problem, but this is going for minutes, which was somewhat expected by those 2 for loops over that large of a table.
I am not too familiar with vectorization (only used it to manipulate single columns in a dataframe) and was wondering if there is a good way to vectorize this, or speed it up in another way. Thanks!