group dataframe by distance to datapoints in different dataframe using long and lat

Question

I have two DataFrames. One contains several power plants and their respective location by longitude and latitude, each in one column. Another dataframe contains several substations, also with long and lat. What I am trying to do is to assign the power plants to the closest substations .

df1 = pd.DataFrame{'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881,  8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]}
df2 = pd.DataFrame{'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]}

I suppose I need to calculate the distance between the all the points and then group the dataframes, but I am not sure how. I found the numpy.linalg.norm() function, but it doesnt really work for me. Any help is appreciated.

I found this solution, which is basically exactly what I need:

import pandas as pd
import geopy.distance



for i,row in test.iterrows(): # A
    df1 = row.x, row.y
    distances = []
    for j,row2 in df2.iterrows(): # B
        b = row2.x, row2.y
        distances.append(geopy.distance.geodesic(a, b).km)

    min_distance = min(distances)
    min_index = distances.index(min_distance)


    print("A", i, "is closest to B", min_index, min_distance, "km")

It works, BUT it takes forever and my dataset is quite large. I think an approach using .apply might be quicker. Anybody got an Idea how to adapt this approach into a apply approach?

Take both df like df1["distance"] =df1.apply(functionname, axis=1) and in functionname check return distance... — Faika Majid, Nov 03 '21 at 10:11
Then do like df.groupby["id_pp"].distance.min() to each dfs to get minimum — Faika Majid, Nov 03 '21 at 10:19
But If I use df1["distance"], the code only checks the distance to one datapoint and not to all, right? — Elias, Nov 03 '21 at 10:21
Consider using the `geopandas` library if that is an option. — Anders Källmar, Nov 03 '21 at 10:24
Also, np.linalg.norm is perhaps not the right 'distance' function here. I think using haversine is appropriate, right? https://stackoverflow.com/questions/4913349/haversine-formula-in-python-bearing-and-distance-between-two-gps-points — Thomas Hilger, Nov 03 '21 at 10:54
Im actually okay with the one from my post - not the np.linalg one, but the geopy function. I just cant find a way to use it without iterating over the rows, which takes way too long. Im not an expert in python, do you see an easy way to reformulate the function from the post into an .apply way without iteration? — Elias, Nov 03 '21 at 11:19
No it will check distance for all the rows, because we are giving it axis=1 — Faika Majid, Nov 03 '21 at 11:55
Example how you will write function name is below def functionname(row): return geopy.distance.geodesic(row['x'], row['y']) — Faika Majid, Nov 03 '21 at 11:59

score 1 · Answer 1 · answered Nov 03 '21 at 20:58

Here is a solution using geopandas. I don't know how well this scales with a larger dataset.

import geopandas as gpd
import pandas as pd

df1 = pd.DataFrame({'ID_pp':['p1','p2','p3','p4'],'x':[12.644881,11.563269, 12.644881,  8.153184], 'y':[48.099206, 48.020081, 48.099206, 49.153766]})
df2 = pd.DataFrame({'ID_ss':['s1','s2','s3','s4'],'x':[9.269, 9.390, 9.317, 10.061], 'y':[55.037, 54.940, 54.716, 54.349]})

# create GeoDataFrames from the original dfs
gdf1 = gpd.GeoDataFrame(df1[['ID_pp']], geometry=gpd.points_from_xy(df1['x'], df1['y']), crs='EPSG:4326')
gdf2 = gpd.GeoDataFrame(df2[['ID_ss']], geometry=gpd.points_from_xy(df2['x'], df2['y']), crs='EPSG:4326')

# convert to another coordinate reference system for units in metres, EPSG:5243 suits Germany as far as I know 
gdf1 = gdf1.to_crs('EPSG:5243')
gdf2 = gdf2.to_crs('EPSG:5243')

gdf2 = gdf2.set_index('ID_ss')

def get_closest_ss(point, other):
    s = other.distance(point)
    return (s.idxmin(), s.min())

# find ID of closest substation to all power plants
gdf1[['closest_ss', 'distance']] = gdf1.geometry.apply(get_closest_ss, args=(gdf2,)).to_list()

# merge the dataframe with the power plants (gdf1) with the closest substation (gdf2)
gdf = gdf1.merge(gdf2, left_on='closest_ss', right_index=True, suffixes=('', '_ss'))

print(gdf)

# output

  ID_pp                         geometry closest_ss       distance  \
0    p1   POINT (159807.847 -320153.333)         s4  717896.945731   
1    p2    POINT (79356.344 -330713.037)         s4  711534.096071   
2    p3   POINT (159807.847 -320153.333)         s4  717896.945731   
3    p4  POINT (-171106.060 -202478.708)         s4  592470.679838   

                     geometry_ss  
0  POINT (-28563.516 372589.227)  
1  POINT (-28563.516 372589.227)  
2  POINT (-28563.516 372589.227)  
3  POINT (-28563.516 372589.227)

group dataframe by distance to datapoints in different dataframe using long and lat

1 Answers1