6

The implicit index-matching of pandas for operations between different DataFrame/Series is great and most of the times, it just works.

However, I've stumbled on an example that does not work as expected:

import pandas as pd # 0.21.0
import numpy as np # 1.13.3
x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

# logical AND: this works, symmetric as it should be
pd.concat([x, y, x & y, y & x], keys = ['x', 'y', 'x&y', 'y&x'], axis = 1)
#        x      y    x&y    y&x
# 0   True    NaN  False  False
# 1  False    NaN  False  False
# 2   True  False  False  False
# 3   True   True   True   True
# 4    NaN   True  False  False
# 5    NaN  False  False  False

# but logical OR is not symmetric anymore (same for XOR: x^y vs. y^x)
pd.concat([x, y, x | y, y | x], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
#        x      y    x|y    y|x
# 0   True    NaN   True  False <-- INCONSISTENT!
# 1  False    NaN  False  False
# 2   True  False   True   True
# 3   True   True   True   True
# 4    NaN   True  False   True <-- INCONSISTENT!
# 5    NaN  False  False  False

Researching a bit, I found two points that seem relevant:

But ultimately, the kicker seems to be that pandas does casting from nan to False at some point. Looking at the above, it appears that this happens after calling np.bitwise_or, while I think this should happen before?

In particular, using np.logical_or does not help because it misses the index alignment that pandas does, and also, I don't want np.nan or False to equal True. (In other words, the answer https://stackoverflow.com/a/37132854/2965879 does not help.)

I think that if this wonderful syntactic sugar is provided, it should be as consistent as possible*, and so | should be symmetric. It's really hard to debug (as happened to me) when something that's always symmetric suddenly isn't anymore.

So finally, the question: Is there any feasible workaround (e.g. overloading something) to salvage x|y == y|x, and ideally in such a way that (loosely speaking) nan | True == True == True | nan and nan | False == False == False | nan?

*even if De Morgan's law falls apart regardless - ~(x&y) can not fully match ~y|~x because the NaNs only come in at the index alignment (and so are not affected by a previous negation).

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59
Axel
  • 636
  • 5
  • 10
  • this issue seems more like a bug than something to work around, and your post is structured better as a bug report than a question. +1 but you may just want to inform the developers about this. – Tadhg McDonald-Jensen Dec 05 '17 at 18:37
  • 1
    There was an [issue](https://github.com/pandas-dev/pandas/issues/6528) about this. I couldn't find at the moment but I remember some SO questions on the same subject, too. – ayhan Dec 05 '17 at 18:48
  • @TadhgMcDonald-Jensen, I agree of course. I was just worried that anything regarding NaN-handling will just get pushed off to pandas 2.0 anyway (where Wes McKinney plans unified NaN-handling). I would prefer a solution before that. ;-) – Axel Dec 05 '17 at 18:59
  • @ayhan, the referenced issue leads to another one (https://github.com/pandas-dev/pandas/issues/13896), where it is marked as a milestone for "1.0" since Oct. 2016 (so missed 0.20 and 0.21 already). Seems like it has low priority... – Axel Dec 05 '17 at 19:00

1 Answers1

3

After doing some exploring in pandas, I discovered that there is a function called pandas.core.ops._bool_method_SERIES which is one of several factory functions that wrap the boolean operators for Series objects.

>>> f = pandas.Series.__or__
>>> f #the actual function you call when you do x|y
<function _bool_method_SERIES.<locals>.wrapper at 0x107436bf8>
>>> f.__closure__[0].cell_contents
    #it holds a reference to the other function defined in this factory na_op
<function _bool_method_SERIES.<locals>.na_op at 0x107436b70>
>>> f.__closure__[0].cell_contents.__closure__[0].cell_contents
    #and na_op has a reference to the built-in function or_
<built-in function or_>

This means we could theoretically define our own method that would perform a logical or with the correct logic, first let's see what it will actually do (remember an operator function is expected to raise a TypeError if the operation can't be performed)

def test_logical_or(a,b):
    print("**** calling logical_or with ****")
    print(type(a), a)
    print(type(b), b)
    print("******")
    raise TypeError("my_logical_or isn't implemented")

#make the wrapper method
wrapper = pd.core.ops._bool_method_SERIES(test_logical_or, None,None)
pd.Series.logical_or = wrapper #insert method


x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

z = x.logical_or(y) #lets try it out!

print(x,y,z, sep="\n")

When this gets run (at least with pandas vs 0.19.1)

**** calling logical_or with ****
<class 'numpy.ndarray'> [True False True True nan nan]
<class 'numpy.ndarray'> [False False False  True  True False]
******
**** calling logical_or with ****
<class 'bool'> True
<class 'bool'> False
******
Traceback (most recent call last):
   ...

So it looks like it tried to call our method with two numpy arrays, where for whatever reason the second one has the nan values already replaced with False but not the first one which is likely why our symmetry breaks. and then when that failed it tried again I'd assume element-wise.

So as a bare minimum to get this working you can just explicitly check that both arguments are numpy arrays, try to convert all the nan entries of the first to False then return np.logical_or(a,b). I'm going to assume if anything else is the case we will just raise an error.

def my_logical_or(a,b):
    if isinstance(a, np.ndarray) and isinstance(b, np.ndarray):
        a[np.isnan(a.astype(float))] = False
        b[np.isnan(b.astype(float))] = False
        return np.logical_or(a,b)
    else:
        raise TypeError("custom logical or is only implemented for numpy arrays")

wrapper = pd.core.ops._bool_method_SERIES(my_logical_or, None,None)
pd.Series.logical_or = wrapper


x = pd.Series([True, False, True, True], index = range(4))
y = pd.Series([False, True, True, False], index = [2,4,3,5])

z = pd.concat([x, y, x.logical_or(y), y.logical_or(x)], keys = ['x', 'y', 'x|y', 'y|x'], axis = 1)
print(z)
#        x      y    x|y    y|x
# 0   True    NaN   True   True
# 1  False    NaN  False  False <-- same!
# 2   True  False   True   True
# 3   True   True   True   True
# 4    NaN   True   True   True <-- same!
# 5    NaN  False  False  False

So that could be your workaround, I would not recommend modifying Series.__or__ since we don't know who else would be using it and don't want to break any code that expects the default behaviour.


Alternatively, we can modify the source code at pandas.core.ops line 943 to fill NaN values with False (or 0) for self in the same way it does with other, so we'd change the line:

    return filler(self._constructor(na_op(self.values, other.values),
                                    index=self.index, name=name))

to use filler(self).values instead of self.values:

    return filler(self._constructor(na_op(filler(self).values, other.values),
                                    index=self.index, name=name))

This also fixes the issue with or and xor not being symmetric, however, I would not recommend this since it may break other code, I personally don't have nearly enough experience with pandas to determine what this would change in different circumstances.

Tadhg McDonald-Jensen
  • 20,699
  • 5
  • 35
  • 59