2

If I use the standard Python boolean operators and/or/not, one nice feature is that they treat None the way I would logically expect. That is, not only

True and True == True
True and False == False

but also

True and None == None
False and None == False
True or None == True
False or None == None

This follows the logic that, for instance, if A is False and B is unknown, (A and B) must still be False, while (A or B) is unknown.

I needed to perform boolean operations on Pandas DataFrames with missing data, and was hoping I'd be able to use the same logic. For boolean logic on numpy arrays and Pandas series, we need to use bitwise operators &/|/~. Pandas seems to have behaviour that is partially the same as and/or/not, but partially different. In short, it seems to return False when the value should logically be unknown.

For example:

a = pd.Series([True,False,True,False])
b = pd.Series([True,True,None,None])

Then we get

> a & b
0     True
1    False
2    False
3    False
dtype: bool

and

> a | b
0     True
1     True
2     True
3    False

I would expect that the output of a & b should be a Series [True,False,None,False] and that the output of a | b should be a Series [True,True,True,None]. The actual result matches what I'd expect except returns False instead of any missing values.

Finally, ~b just gives a TypeError:

TypeError: bad operand type for unary ~: 'NoneType'

which seems odd since & and | at least partially work.

Is there a better way to carry out boolean logic in this situation? Is this a bug in Pandas?

Analogous tests with numpy arrays just give type errors, so I assume Pandas is handling the logic itself here.

Sociopath
  • 13,068
  • 19
  • 47
  • 75
user2428107
  • 3,003
  • 3
  • 17
  • 19

1 Answers1

0

You might need something like this:

c = pd.Series([x and y for x,y in zip(a,b)])

print(c)

Output:

0     True
1    False
2     None
3    False

And correspondingly, for the second expression:

d = pd.Series([x or y for x,y in zip(a,b)])

print(d)

Output:

0    True
1    True
2    True
3    None

also look at here for understanding and and & operations.


If you want to and two columns a and b of a dataframe df, one way is to define a function and apply it to df:

df = pd.DataFrame({'a':[True,False,True,False], 'b':[True,True,None,None]})
def and_(row):
    return row['a'] and row['b']
df.loc[:, 'a_and_b'] = df.apply(and_, axis=1)
print(df)

Output:

       a     b a_and_b
0   True  True    True
1  False  True   False
2   True  None    None
3  False  None   False
Ala Tarighati
  • 3,507
  • 5
  • 17
  • 34
  • 1
    I know I can work with lists, but it loses the advantages of vectorisation, and the code will be clunkier for DataFrames. I was hoping for a way to use pandas- or numpy-native functionality. – user2428107 Nov 06 '18 at 01:54
  • I think you can not do an element-wise `and` by just `a and b` when `a` and `b` are series (while you could do element-wise `&`). In case you want to `and` two columns of a dataframe, look at the revised answer. – Ala Tarighati Nov 06 '18 at 13:55