2

I have two pandas dataframes:

df1 = pd.DataFrame({'col1': [1.2574, 5.3221, 4.3215, 9.8841], 'col2': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col1': [4.326, 9.89, 5.326, 1.2654], 'col2': ['w', 'x', 'y', 'z']})

Now I want to compare the values in col1 of both dataframes. Consider 5.3221 from df1, I want to check if this value exists in df2['col1'] with an error of 0.005 (in this very example 5.326 from df2['col1'] should be considered equal to 5.3221) and make a third dataframe to hold both columns from df1 and df2 where the above said condition is true.

The expected output is:

    col1    col2    col1.1  col2.2
0   5.3221  b       5.236   y
1   4.3215  c       4.326   w

I have defined a function which is able to take care of the error condition:

def close(a, b, e=0.005):
    return round(abs(a - b), 3) <= e

But I don't know how to apply this on the data without using a for loop. I also know that I can use numpy.intersect1d but I can not figure out how.

Any help would be appreciated :)

EDIT: The suggested duplicate answer doesn't address my problem. That question just works on combining two dataframes based on similar looking indices. Also difflib is used to find word matches and not integer. My scenario is completely different.

  • round(abs(a - b), 3) == e should be round(abs(a - b), 3) <= e , and I don't think you need to round the difference. – nicomp Jul 28 '21 at 11:34
  • Does this answer your question? [is it possible to do fuzzy match merge with python pandas?](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) – Grzegorz Skibinski Jul 28 '21 at 11:55

2 Answers2

2

I have added code which words

First calculate the distance between each point as cross, then filter. Get those rows and merge

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'col1': [1.2574, 5.3221, 4.3215, 9.8841], 'col2': ['a', 'b', 'c', 'd']})
df2 = pd.DataFrame({'col1': [4.326, 9.89, 5.326, 1.2654], 'col2': ['w', 'x', 'y', 'z']})

# Get the target columns
c11 = df1['col1'].to_numpy()
c21 = df2['col1'].to_numpy()

# calculate cross errors by broadcast and filter columns
# these will be indices of rows to be inserted in new df
c = np.argwhere(np.abs(c11[:, np.newaxis] - c21) < 0.005)


x = pd.DataFrame()
# Insert by removing index otherwise it will try to match the indexs are change row orders
x[['col1', 'col2']] = df1.iloc[c[:, 0]][['col1', 'col2']].reset_index(drop=True)
x[['col1.1', 'col2.2']] = df2.iloc[c[:, 1]][['col1', 'col2']].reset_index(drop=True)

print(x)

eroot163pi
  • 1,791
  • 1
  • 11
  • 23
1

NumPy's broadcasting can be used for cross comparison and getting the indices in each frame where the difference falls into error margin. Then we index into the frames and concatenate the results:

# find where two frames are close
eps = 0.005
diff = np.abs(df1.col1.to_numpy()[:, np.newaxis] - df2.col1.to_numpy())
inds_1, inds_2 = np.where(diff <= eps)

# filter the frames with these indices
first = df1.iloc[inds_1].reset_index(drop=True)
second = df2.iloc[inds_2].reset_index(drop=True)

# adjust column names of the second one, e.g., "col2.2"
second.columns = [col + f".{j}" for j, col in enumerate(second.columns, start=1)]

# put together
result = pd.concat([first, second], axis=1)

to get

>>> result

     col1 col2  col1.1 col2.2
0  5.3221    b   5.326      y
1  4.3215    c   4.326      w

Intermediate result diff is:

>>> diff

array([[3.0686e+00, 8.6326e+00, 4.0686e+00, 8.0000e-03],
       [9.9610e-01, 4.5679e+00, 3.9000e-03, 4.0567e+00],
       [4.5000e-03, 5.5685e+00, 1.0045e+00, 3.0561e+00],
       [5.5581e+00, 5.9000e-03, 4.5581e+00, 8.6187e+00]])

of shape (len(df1), len(df2)), where ij'th entry is df1.col1[i] - df2.col1[j].

Mustafa Aydın
  • 17,645
  • 4
  • 15
  • 38
  • 1
    Thanks, this works just as I expected. One thing I'd like to point is please edit `np.where(diff < eps)` to `np.where(diff <= eps)` :) – theProcrastinator Jul 28 '21 at 12:41