0

I have two dataframes, both of which contain 3 values df1(X,Y,Z), df2(A,B,C) I wish to create a new dataframe, df3, that contains extracts the closest match for all the components in df2 for each row in df1; i.e. for each row in df1 return the row in df2 with the closest match of A to X, B to Y, C to Z.

Similar to How do I find the closest values in a Pandas series to an input number?, however this question only seeks to return the nearest row based on a single value, whereas I am seeking to return the row in which the nearest match of all three columns is returned and the process is iterated for each row in df1 as opposed to a single value.

Inputs
df1

 Index A    B    C
 1     2    4    6 
 2     0.8  7    9
  

df2

Index X     Y     Z
1     11    3     4.5
2     2.2   4.4   5.8
3     1     6.8   9.3
4     1.3   10    9.5

Output

df3

 Index X     Y     Z
 2     0.32  0.11  0.4
 3     1     6.8   9.3
SasCom
  • 47
  • 1
  • 7
  • How efficient should the solution be / how much data are we talking about? One way could be to create another column with the difference and selecting the argmin with respect to this new column. Would that work for you? – André Jul 05 '21 at 15:21
  • @André Thats good point that I should have mentioned, some of the df2 dataframes in my example will be as large as 3 million rows. I believe what you are suggesting is similiar to Michael O's answer below - I'm unsure how efficient this would be but will give it a try. Thanks. – SasCom Jul 05 '21 at 15:58
  • 1
    I've edited my answer to get a desired result. – Michael Jul 05 '21 at 19:55

1 Answers1

2

Find a minimum of their difference's product:

df3 = pd.DataFrame()
for k, v in df1.iterrows(): 
    i = ((df2['X']-v['A']) * \
       (df2['Y']-v['B']) * \
       (df2['Z']-v['C'])).abs().idxmin() 
    df3 = df3.append(df2.loc[i])

print(df3)


     X    Y    Z
2  2.2  4.4  5.8
3  1.0  6.8  9.3
Michael
  • 5,095
  • 2
  • 13
  • 35
  • 2
    Note that sorting takes `O(nlogn)` computation steps, whereas the minimum can be found in `O(n)`. Using `idxmin` instead of `sort_values().index[0]` will boost performance for larger datasets. – André Jul 06 '21 at 08:46
  • @André Thank you for this advice. – Michael Jul 06 '21 at 09:02