np.where multiple logical statements pandas

Question

I am aware there are many questions on the topic of chained logical operators using np.where.

I have 2 dataframes:

df1
   A  B  C  D  E  F Postset
0  1  2  3  4  5  6     yes
1  1  2  3  4  5  6      no
2  1  2  3  4  5  6     yes

df2
   A  B  C  D  E  F Preset
0  1  2  3  4  5  6    yes
1  1  2  3  4  5  6    yes
2  1  2  3  4  5  6    yes

I want to compare the uniqueness of the rows in each dataframe. To do this, I need to check that all values are equal for a number of selected columns.

From this question: if I am checking columns a b c d e f I can do:

np.where((df1.A != df2.A) | (df1.B != df2.B) | (df1.C != df2.C) | (df1.D != df2.D) | (df1.E != df2.E) | (df1.F != df2.F))

Which correctly gives:

(array([], dtype=int64),)

i.e. the values in all columns are independently equal for both dataframes.

This is fine for a small dataframe, but my real dataframe has a high number of columns that I must check. The np.where condition is too long to write out with accuracy.

Instead, I would like to put my columns into a list:

columns_check_list = ['A','B','C','D','E','F']

And use my np.where statement to perform my check over all columns automatically.

This obviously doesn't work, but its the type of form I am looking for. Something like:

check = np.where([df[column) != df[column] | for column in columns_check_list])

How can I achieve this?

Considerations:

I have many columns
The format of my data is fixed.
The values within columns may contain either strings or floats.

MSeifert · Answer 1 · 2017-04-28T10:27:42.063

4

You can use np.logical_ors reduce method on the values of the comparison:

>>> import numpy as np
>>> np.logical_or.reduce((df1 != df2).values, axis=1)  # along rows
array([False, False, False], dtype=bool)               # each value represents a row

You might need to exclude columns either before doing the comparison:

(df1[include_columns_list] != df2[include_columns_list]).values

or after:

(df1 != df2)[include_columns_list].values

Besides np.logical_or there's also a np.bitwise_or but if your dealing with booleans (and the comparison returns an array of booleans) these are equivalent.

edited Apr 28 '17 at 10:27

answered Apr 28 '17 at 10:10

MSeifert

145,886
38
333
352

`np.logical_or.reduce((df1 != df2)[columns_list].values, axis=1)` ? – Chuck Apr 28 '17 at 10:13

score 4 · Accepted Answer · edited May 23 '17 at 12:02

4

It seems you need all for check if all values are True per row or any if at lest one value is True per row:

mask= ~(df1[columns_check_list] == df2[columns_check_list]).all(axis=1).values
print (mask)
[False False False]

Or more readable, thanks IanS:

mask= (df1[columns_check_list] != df2[columns_check_list]).any(axis=1).values
print (mask)
[False False False]

Is also possible compare numpy arrays:

mask= (df1[columns_check_list].values != df2[columns_check_list].values).any(axis=1)
print (mask)
[False False False]

edited May 23 '17 at 12:02

Community

1
1

answered Apr 28 '17 at 10:11

jezrael

822,522
95
1,334
1,252

Or `(df1[columns_check_list] != df2[columns_check_list]).any(axis=1)` which would be more human-readable? – IanS Apr 28 '17 at 10:12
1

Why do I always forget `all` and `any`. These are probably better suited in this case because they can short-circuit (they only process the row as long as there wasn't a difference yet) the operation. – MSeifert Apr 28 '17 at 10:25
Thank you @jezrael. I completely forgot about `np.any` and `np.all`. It works great. Oh, and congratulations on 100k too! :) – Chuck Apr 28 '17 at 13:07

np.where multiple logical statements pandas

2 Answers2

Linked