In pandas is best avoid loops with iterrows
, because slow. So better is use very fast vectorized pandas
or numpy
functions.
If need check if not present in column - use isin
with ~
for invert boolean mask:
mask = ~df1.col1.isin(df2.col1)
print (mask)
0 False
1 True
2 True
3 True
Name: col1, dtype: bool
alternative solution is use numpy.in1d
:
mask = ~np.in1d(df1.col1,df2.col1)
print (mask)
[False True True True]
If need check per rows use !=
or ne
:
mask = df1.col1 != df2.col1
#same as
#mask = df1.col1.ne(df2.col1)
print (mask)
0 False
1 True
2 True
3 True
Name: col1, dtype: bool
Or:
mask = df1.col1.values != df2.col1.values
print (mask)
[False True True True]
And if new column by mask is possible use numpy.where
:
df1['new'] = np.where(mask, 'a', 'b')
print (df1)
col1 col2 new
0 a e b
1 b f a
2 c g a
3 d h a
Difference is better seen in a bit different DataFrames
:
print (df1)
col1 col2
0 a e
1 b f
2 c g
3 d h
print (df2)
col1 col2
0 a y
1 v u
2 d z <- change value to d
3 w t
mask = df1.col1 != df2.col1
print (mask)
0 False
1 True
2 True
3 True
Name: col1, dtype: bool
mask = ~df1.col1.isin(df2.col1)
print (mask)
0 False
1 True
2 True
3 False
Name: col1, dtype: bool
Numpy solutions are obviously faster:
In [23]: %timeit (~df1.col1.isin(df2.col1))
The slowest run took 7.98 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 198 µs per loop
In [24]: %timeit (~np.in1d(df1.col1,df2.col1))
The slowest run took 9.25 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 42.5 µs per loop