non-pairwise distance measure while preserve all columns from original geopandas dataframes

Question

Here provides a solution to do non-pairwise distance calculation between two geopandas dataframes (gdf). However, the outcome distance matrix only preserves index from the two gdf, which may not readable. I add some columns to the gdf as following and then get the distance matrix:

import pandas as pd
import geopandas as gpd

gdf_1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy([0, 0, 0], [0, 90, 120]))
gdf_2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy([0, 0], [0, -90]))

home = ['home_1', 'home_2', 'home_3']
shop = ['shop_1', 'shop_2']

gdf_1['home'] = home
gdf_2['shop'] = shop

gdf_1.geometry.apply(lambda g: gdf_2.distance(g))

As the above table shows, nothing from the original gdf is preserved in the outcome except for the index, which may not intuitive and useful. I was wondering how to preserve all the original columns from both gdf in the outcome distance matrix, or at least keep the "home", "shop", and "distance" columns like this:

Please note: "distance" is the distance measure from home to shop, and the other "geometry" column may need a suffix

Matthew Borish · Accepted Answer · 2021-02-23T20:56:17.900

You can use a combination of stack and merge to create your desired output.

import pandas as pd
import geopandas as gpd

gdf_1 = gpd.GeoDataFrame(geometry=gpd.points_from_xy([0, 0, 0], [0, 90, 120]))
gdf_2 = gpd.GeoDataFrame(geometry=gpd.points_from_xy([0, 0], [0, -90]))

home = ['home_1', 'home_2', 'home_3']
shop = ['shop_1', 'shop_2']

gdf_1['home'] = home
gdf_2['shop'] = shop

# set indices so we can have them in gdf_3 
# you could also do this when making gdf_1 and gdf
gdf_1.index = gdf_1['home']
gdf_2.index = gdf_2['shop']


gdf_3 = gdf_1.geometry.apply(lambda g: gdf_2.distance(g))

# reshape our data, stack returns a series here, but we want a df
gdf_4 = pd.DataFrame(gdf_3.stack(level=- 1, dropna=True))
gdf_4.reset_index(inplace=True)

# merge the original columns over
df_merge_1 = pd.merge(gdf_4, gdf_2,
                        left_on='shop',
                        right_on=gdf_2.index,
                        how='outer').fillna('')

df_merge_2 = pd.merge(df_merge_1, gdf_1,
                        left_on='home',
                        right_on=gdf_1.index,
                        how='outer').fillna('')

# get rid of extra cols
df_merge_2 = df_merge_2[[ 'shop',  'home',   0, 'geometry_x',  'geometry_y']]

# rename cols
df_merge_2.columns = ['shop', 'home', 'distance', 'geometry_s', 'geometry_h']

df_merge_2 is a pandas df, but you can create a gdf easily.

df_merge_2_gdf = gpd.GeoDataFrame(df_merge_2, geometry=df_merge_2['geometry_h'])

Hi, this works. Just want to go further, if we have more columns from both gdf_1 and gdf_2, for example, zip codes and values for home and names for shop, how could all these columns be preserved in the final outcome dataframe using your approach — Neo, Feb 24 '21 at 03:55
Great! You can run without the code below the # get rid of extra cols line, and adjust the #remake cols code accordingly. Drop another comment if you need a hand. — Matthew Borish, Feb 24 '21 at 04:04

non-pairwise distance measure while preserve all columns from original geopandas dataframes

1 Answers1