I have two DataFrames. One contains several power plants and their respective location by longitude and latitude, each in one column. Another dataframe contains several substations, also with long and lat. What I am trying to do is to assign the power plants to the closest substations .
df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881, 8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}
I suppose I need to calculate the distance between the all the points and then group the dataframes, but I am not sure how. I found the numpy.linalg.norm() function, but it doesnt really work for me. Any help is appreciated.
I found this solution, which is basically exactly what I need:
import pandas as pd
import geopy.distance
for i,row in test.iterrows(): # A
df1 = row.x, row.y
distances = []
for j,row2 in df2.iterrows(): # B
b = row2.x, row2.y
distances.append(geopy.distance.geodesic(a, b).km)
min_distance = min(distances)
min_index = distances.index(min_distance)
print("A", i, "is closest to B", min_index, min_distance, "km")
It works, BUT it takes forever and my dataset is quite large. I think an approach using .apply might be quicker. Anybody got an Idea how to adapt this approach into a apply approach?