0

I want to check if the value of a row in one dataframe is not contained in a specific column of another dataframe throug an "if" conditional

If my dataframes are:

df1:
   col1 col2
0   a   e
1   b   f
2   c   g
3   d   h

df2:

   col1 col2
0   a   y
1   v   u
2   x   z
3   w   t

I want to iterate through every row in col1 in df1 and check if that value is not contained in col1 of df2

My current code is:

 for row, i in df1.iterrows():
    for row, j in df2.iterrows():
       if i.col1 not in j.col1:
          print("blu")

Right now the code will enter the if condition even though the value in col1 of df1 IS contained in col1 of df2

Any help will be appreciated.

Yusef Jacobs
  • 107
  • 1
  • 1
  • 4

2 Answers2

0

Use isin

df1.col1.isin(df2.col1)

0     True
1    False
2    False
3    False
Name: col1, dtype: bool
piRSquared
  • 285,575
  • 57
  • 475
  • 624
0

In pandas is best avoid loops with iterrows, because slow. So better is use very fast vectorized pandas or numpy functions.

If need check if not present in column - use isin with ~ for invert boolean mask:

mask = ~df1.col1.isin(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

alternative solution is use numpy.in1d:

mask = ~np.in1d(df1.col1,df2.col1)
print (mask)
[False  True  True  True]

If need check per rows use != or ne:

mask = df1.col1 != df2.col1
#same as 
#mask = df1.col1.ne(df2.col1)
print (mask)

0    False
1     True
2     True
3     True
Name: col1, dtype: bool

Or:

mask = df1.col1.values != df2.col1.values
print (mask)
[False  True  True  True]

And if new column by mask is possible use numpy.where:

df1['new'] = np.where(mask, 'a', 'b')
print (df1)
  col1 col2 new
0    a    e   b
1    b    f   a
2    c    g   a
3    d    h   a

Difference is better seen in a bit different DataFrames:

print (df1)
  col1 col2
0    a    e
1    b    f
2    c    g
3    d    h

print (df2)
  col1 col2
0    a    y
1    v    u
2    d    z <- change value to d
3    w    t


mask = df1.col1 != df2.col1
print (mask)
0    False
1     True
2     True
3     True
Name: col1, dtype: bool

mask = ~df1.col1.isin(df2.col1)
print (mask)
0    False
1     True
2     True
3    False
Name: col1, dtype: bool

Numpy solutions are obviously faster:

In [23]: %timeit (~df1.col1.isin(df2.col1))
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 198 µs per loop

In [24]: %timeit (~np.in1d(df1.col1,df2.col1))
The slowest run took 9.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 42.5 µs per loop
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Thank you, great answer. All of the solutions work to determine if a value is in a column but when I try to use this with a conditional like "if" i get the following error: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all(). – Yusef Jacobs May 10 '17 at 19:20
  • You need add any for chech if at least one value is true by `if mask.any(): print 'at least one True'` – jezrael May 10 '17 at 19:28
  • Also better explain it [this](http://stackoverflow.com/questions/42071684/if-condition-for-comparison-between-two-dataframes-in-python) answer. – jezrael May 10 '17 at 19:32