8

I reviewed the following posts beforehand. Is there a way to use DataFrame.isin() with an approximation factor or a tolerance value? Or is there another method that could?

Filter dataframe rows if value in column is in a set list of values

use a list of values to select rows from a pandas dataframe

EX)

df = DataFrame({'A' : [5,6,3.3,4], 'B' : [1,2,3.2, 5]})

In : df
Out:
   A    B
0  5    1
1  6    2
2  3.3  3.2
3  4    5  

df[df['A'].isin([3, 6], tol=.5)]

In : df
Out:
   A    B
1  6    2
2  3.3  3.2
Community
  • 1
  • 1
webmaker
  • 456
  • 1
  • 5
  • 15
  • In this exact case, you can create copies of A & B that are rounded to the nearest integer then use those to identify valid index values in the original columns. In other words, you can implement the tolerances on the data side rather than the function side. – Sohier Dane Sep 20 '16 at 19:12

1 Answers1

15

You can do a similar thing with numpy's isclose:

df[np.isclose(df['A'].values[:, None], [3, 6], atol=.5).any(axis=1)]
Out: 
     A    B
1  6.0  2.0
2  3.3  3.2

np.isclose returns this:

np.isclose(df['A'].values[:, None], [3, 6], atol=.5)
Out: 
array([[False, False],
       [False,  True],
       [ True, False],
       [False, False]], dtype=bool)

It is a pairwise comparison of df['A']'s elements and [3, 6] (that's why we needed df['A'].values[: None] - for broadcasting). Since you are looking for whether it is close to any one of them in the list, we call .any(axis=1) at the end.


For multiple columns, change the slice a little bit:

mask = np.isclose(df[['A', 'B']].values[:, :, None], [3, 6], atol=0.5).any(axis=(1, 2))
mask
Out: array([False,  True,  True, False], dtype=bool)

You can use this mask to slice the DataFrame (i.e. df[mask])


If you want to compare df['A'] and df['B'] (and possible other columns) with different vectors, you can create two different masks:

mask1 = np.isclose(df['A'].values[:, None], [1, 2, 3], atol=.5).any(axis=1)
mask2 = np.isclose(df['B'].values[:, None], [4, 5], atol=.5).any(axis=1)
mask3 = ...

Then slice:

df[mask1 & mask2]  # or df[mask1 & mask2 & mask3 & ...]
ayhan
  • 70,170
  • 20
  • 182
  • 203
  • @webmaker When I try it returns an empty DataFrame because the values in D column are less than 0.15 (you passed 0.1 as tolerance and the smallest value in the list is 0.25 so it doesn't return True for any of them. Should it?) – ayhan Sep 20 '16 at 19:52
  • 1
    it was something else, I managed to fix the issue. now im trying to figure out how to do multiple columns at a time. perhaps join/concatenate df2 and df3? df2=df[np.isclose(df['B'].values[:, None], [0.939,0.874,1.0, ], atol=.05).any(axis=1)] df3=df[np.isclose(df['A'].values[:, None], [-0.12,0.0,0.12], atol=.05).any(axis=1)] – webmaker Sep 20 '16 at 20:39
  • 1
    @webmaker I've edited the post to include a solution for multiple columns – ayhan Sep 20 '16 at 21:06
  • Sorry I might have misunderstood it. You have different vectors to compare for column A and B? If so I believe you need to save them as two masks (one for A and one for B) and use `df[mask1 & mask2]` – ayhan Sep 20 '16 at 21:30
  • something like this mask = np.isclose(df[['A', 'B', 'C', 'D']].values[:, :, :,:,None], [a_val, b_val, c_val, d_val], atol=0.5).any(axis=(1, 2,3,4)) where a_val correspoonds to df['A'], b_val to df['B'] and so on – webmaker Sep 20 '16 at 21:47
  • @webmaker In that case you need to create multiple masks. The slice `[:, :, None]` is from 2D to 3D, so the number of columns is not relevant here. If you have 4, you'd still use `[:, :, None]`. The part about multiple columns in the answer would have worked for `df[['A', 'B', 'C', 'D']]` too, as long as you compare it against a single list/array. I've added a part about creating different masks. – ayhan Sep 20 '16 at 22:01