Pandas fuzzy merge by int coordinates

Question

Imagine we have 2 dataframes with coordinates ['X','Y']:

df1 :

 X            Y          House №
2531        2016           175
2219        2196           11
2901        3426           201
6901        4431           46
7891        1126           89

df2 :

 X            Y      Delivery office №
2534        2019            O1
6911        4421            O2
2901        3426            O3
7894.5      1120            O4

My idea is to merge them and get:

df3

 X            Y          House №    Delivery office №
2531        2016           175            01
2219        2196           11             NA
2901        3426           201            03
6901        4431           46             02
7891        1126           89             04

So we wants to realise 'fuzzy' merge by threshold (this param should be given by user). You can see that house number 11 didn't get any delivery office number because it located too much away from all of presented offices in df2.

So I need all rows from df2 'find' it's closest row from df1 and add it's 'Cost' value to it You can see that usual in-box pd.merge do not work there as well as custom packages that realize fuzzy logic relates to string values using levenshtein distance and so on

Does this answer your question? [pandas merging based on a timestamp which do not match exactly](https://stackoverflow.com/questions/34880539/pandas-merging-based-on-a-timestamp-which-do-not-match-exactly) — luigigi, Jan 16 '20 at 08:19
@luigigi thank you, but my problem is quite more easier and not relates to time/date stamps — Андрей Севостьянов, Jan 16 '20 at 08:22

Sandipan Dey · Answer 1 · 2020-01-16T19:36:39.570

I think the following should do the trick, computing the Euclidean distances along with left-outer join, don't need to use any fuzzy logic though:

import pandas as pd
import numpy as np

df1 = pd.DataFrame([[2531,2016,175],[2219,2196,11],[2901,3426,201],[6901,4431,46],[7891,1126,89]], columns = ['X','Y','House'])
df2 = pd.DataFrame([[2534,2019,'O1'],[6911,4421,'O2'],[2901,3426,'03'],[7894.5,1120,'O4']], columns = ['X','Y','Delivery office'])
threshold = 20

df = df1.assign(key=1).merge(df2.assign(key=1), on="key").drop("key", axis=1)
df['dist'] = df.apply(lambda row: np.linalg.norm(row[['X_x', 'Y_x']].values - row[['X_y','Y_y']].values), axis=1)
df = df.sort_values(by=['dist'])
print(df.head())
#     X_x   Y_x  House     X_y   Y_y Delivery office        dist
#10  2901  3426    201  2901.0  3426              03    0.000000
#0   2531  2016    175  2534.0  2019              O1    4.242641
#19  7891  1126     89  7894.5  1120              O4    6.946222
#13  6901  4431     46  6911.0  4421              O2   14.142136
#4   2219  2196     11  2534.0  2019              O1  361.322571

df = pd.merge(df1, df[df.dist < threshold], left_on=['X','Y','House'], right_on=['X_x','Y_x','House'], how='left')[['X','Y','House','Delivery office']]
print(df.head())
#      X     Y  House Delivery office
#0  2531  2016    175              O1
#1  2219  2196     11             NaN
#2  2901  3426    201              03
#3  6901  4431     46              O2
#4  7891  1126     89              O4

EDIT

Posting yet another solution, this time without using join - it will be more expensive in terms of execution time but will have less space complexity.

res = df1.apply(lambda row: df2.apply(lambda row2: np.linalg.norm(row[['X', 'Y']].values - row2[['X', 'Y']].values), axis=1), axis=1)
df1['dist'] = res.min(axis=1)
df1['Delivery office'] = df2.iloc[res.idxmin(axis=1).values,:]['Delivery office'].values
df1.loc[df1.dist >= threshold, 'Delivery office'] = np.nan
print(df1.head())
#       X     Y  House        dist Delivery office
# 0  2531  2016    175    4.242641              O1
# 1  2219  2196     11  361.322571             NaN
# 2  2901  3426    201    0.000000              03
# 3  6901  4431     46   14.142136              O2
# 4  7891  1126     89    6.946222              O4
df1 = df1[['X','Y','House','Delivery office']]

good, thank you. Can't test your solution on my data, because it crashes (memory error), "df" is become to big suddenly. — Андрей Севостьянов, Jan 16 '20 at 12:07
Posted another solution that has less space complexity and more time complexity, you should be able to test this i think @Андрей Севостьянов — Sandipan Dey, Jan 16 '20 at 19:37

Pandas fuzzy merge by int coordinates

1 Answers1

EDIT