2

Let's suppose I have these two dataframes with the same number of columns, but possibly different number of rows:

tmp = np.arange(0,12).reshape((4,3))
df = pd.DataFrame(data=tmp) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

print(df)
   0   1   2
0  0   1   2
1  3   4   5
2  6   7   8
3  9  10  11

print(df2)
     a    b    c
0    3    4    5
1  100    4  100
2  101  100    3

I want to verify if the rows of df2 are matching any rows of df, that is I want to obtain a series (or an array) of boolean values that gives this result:

0     True
1    False
2    False
dtype: bool

I think something like the isin method should work, but I got this result, which results in a dataframe and is wrong:

print(df2.isin(df))
       a      b      c
0  False  False  False
1  False  False  False
2  False  False  False

As a constraint, I wish to not use the merge method, since what I am doing is in fact a check on the data before applying merge itself. Thank you for your help!

ixaixim
  • 83
  • 8

5 Answers5

5

You can use numpy.isin, which will compare all elements in your arrays and return True or False for each element for each array.

Then using all() on each array, will get your desired output as the function returns True if all elements are true:

>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

0     True
1    False
2    False
dtype: bool

Breakdown of what is happening:

# np.isin
>>> np.isin(df2.values,df.values)

Out[139]: 
array([[ True,  True,  True],
       [False,  True, False],
       [False, False,  True]])

# all()
>>> [m.all() for m in np.isin(df2.values,df.values)]

Out[140]: [True, False, False]

# pd.Series()
>>> pd.Series([m.all() for m in np.isin(df2.values,df.values)])

Out[141]: 
0     True
1    False
2    False
dtype: bool
sophocles
  • 13,593
  • 3
  • 14
  • 33
  • 1
    This sounds inefficient. It is O(n^2) where n is the number of elements in the data frames. Since OP just wants matching rows, it should be doable in O(n). – user2315856 Dec 23 '21 at 12:15
  • According to the answer below it's not the most inefficient approach, actually much much better than using ```apply```. I tested the other answers, the ones that work for me are Corralien's and sammy's which are all inefficient apparently. The other ones didn't work for me. – sophocles Dec 23 '21 at 12:24
1

There may be more efficient solutions, but you could append the two dataframes can call duplicated, e.g.:

df.append(df2).duplicated().iloc[df.shape[0]:]

This assumes that all rows in each DataFrame are distinct. Here are some benchmarks:

tmp1 = np.arange(0,12).reshape((4,3))
df1 = pd.DataFrame(data=tmp1,  columns=["a", "b", "c"]) 

tmp2 = {'a':[3,100,101], 'b':[4,4,100], 'c':[5,100,3]}
df2 = pd.DataFrame(data=tmp2)

df1 = pd.concat([df1] * 10_000).reset_index()
df2 = pd.concat([df2] * 10_000).reset_index()

%timeit df1.append(df2).duplicated().iloc[df1.shape[0]:]
# 100 loops, best of 5: 4.16 ms per loop
%timeit pd.Series([m.all() for m in np.isin(df2.values,df1.values)])
# 10 loops, best of 5: 74.9 ms per loop
%timeit df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
# 1 loop, best of 5: 443 ms per loop
hilberts_drinking_problem
  • 11,322
  • 3
  • 22
  • 51
1

Use np.in1d:

>>> df2.apply(lambda x: all(np.in1d(x, df)), axis=1)
0     True
1    False
2    False
dtype: bool

Another way, use frozenset:

>>> df2.apply(frozenset, axis=1).isin(df1.apply(frozenset, axis=1))
0     True
1    False
2    False
dtype: bool
Corralien
  • 109,409
  • 8
  • 28
  • 52
1

You can use a MultiIndex (expensive IMO):

pd.MultiIndex.from_frame(df2).isin(pd.MultiIndex.from_frame(df))
Out[32]: array([ True, False, False])

Another option is to create a dictionary, and run isin:

df2.isin({key : array.array for key, (_, array) in zip(df2, df.items())}).all(1)
Out[45]: 
0     True
1    False
2    False
dtype: bool
sammywemmy
  • 27,093
  • 4
  • 17
  • 31
0

Try:

df[~df.apply(tuple,1).isin(df2.apply(tuple,1))]

Here is my result:

Result

Girolamo
  • 326
  • 3
  • 11