Python Pandas: Find closest match for multiple columns between two dataframes

Question

I have two dataframes, both of which contain 3 values df1(X,Y,Z), df2(A,B,C) I wish to create a new dataframe, df3, that contains extracts the closest match for all the components in df2 for each row in df1; i.e. for each row in df1 return the row in df2 with the closest match of A to X, B to Y, C to Z.

Similar to How do I find the closest values in a Pandas series to an input number?, however this question only seeks to return the nearest row based on a single value, whereas I am seeking to return the row in which the nearest match of all three columns is returned and the process is iterated for each row in df1 as opposed to a single value.

Inputs
df1

 Index A    B    C
 1     2    4    6 
 2     0.8  7    9

df2

Index X     Y     Z
1     11    3     4.5
2     2.2   4.4   5.8
3     1     6.8   9.3
4     1.3   10    9.5

Output

df3

 Index X     Y     Z
 2     0.32  0.11  0.4
 3     1     6.8   9.3

How efficient should the solution be / how much data are we talking about? One way could be to create another column with the difference and selecting the argmin with respect to this new column. Would that work for you? — André, Jul 05 '21 at 15:21
@André Thats good point that I should have mentioned, some of the df2 dataframes in my example will be as large as 3 million rows. I believe what you are suggesting is similiar to Michael O's answer below - I'm unsure how efficient this would be but will give it a try. Thanks. — SasCom, Jul 05 '21 at 15:58

Michael · Accepted Answer · 2021-07-06T09:01:59.123

2

Find a minimum of their difference's product:

df3 = pd.DataFrame()
for k, v in df1.iterrows(): 
    i = ((df2['X']-v['A']) * \
       (df2['Y']-v['B']) * \
       (df2['Z']-v['C'])).abs().idxmin() 
    df3 = df3.append(df2.loc[i])

print(df3)


     X    Y    Z
2  2.2  4.4  5.8
3  1.0  6.8  9.3

edited Jul 06 '21 at 09:01

answered Jul 05 '21 at 15:36

Michael

5,095
2
13
35

2

Note that sorting takes `O(nlogn)` computation steps, whereas the minimum can be found in `O(n)`. Using `idxmin` instead of `sort_values().index[0]` will boost performance for larger datasets. – André Jul 06 '21 at 08:46
@André Thank you for this advice. – Michael Jul 06 '21 at 09:02

Python Pandas: Find closest match for multiple columns between two dataframes

1 Answers1