Introduction:
Given a dataframe, I though that the following were True:
df[(condition_1) | (condition_2)] <=> df[(condition_2) | (condition_1)]
as in
df[(df.col1==1) | (df.col1==2)] <=> df[(df.col1==2) | (df.col1==1)]
Problem:
But it turns out that it fails in the following situation, where it involves NaN
which is probably the reason why it fails:
df = pd.DataFrame([[np.nan, "abc", 2], ["abc", 2, 3], [np.nan, 5,6], [8,9,10]], columns=["A", "B", "C"])
df
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
3 8 9 10
The following works as expected:
df[(df.A.isnull()) | (df.A.str.startswith("a"))]
A B C
0 NaN abc 2
1 abc 2 3
2 NaN 5 6
But if I commute the elements, I get a different result:
df[(df.A.str.startswith("a")) | (df.A.isnull())]
A B C
1 abc 2 3
I think that the problems comes from this condition:
df.A.str.startswith("a")
0 NaN
1 True
2 NaN
3 NaN
Name: A, dtype: object
Where I have NaN
instead of False
.
Questions:
- Is this behavior expected? Is it a bug ? Because it can lead to potential loss of data if one is not expecting this kind of behavior.
- Why it behaves like this (in a non commutative way) ?
More details:
More precisely, let's C1 = (df.A.str.startswith("a"))
and C2 = (df.A.isnull())
:
with:
C1 C2
NaN True
True False
NaN True
NaN False
We have:
C1 | C2
0 False
1 True
2 False
3 False
Name: A, dtype: bool
Here C2 is not evaluated, and NaN becomes False.
And here:
C2 | C1
0 True
1 True
2 True
3 False
Name: A, dtype: bool
Where NaN is False (it returns all False with an &
) but both conditions are evaluated.
Clearly: C1 | C2 != C2 | C1
I wouldn't mind that NaN
produce weird results as long as the commutativity is preserved, but here there is one condition that is not evaluated.
Actually the NaN in the input isn't the problem, because you have the same problem on column B
:
(df.B.str.startswith("a")) | (df.B==2) != (df.B==2) | (df.B.str.startswith("a"))
It's because applying str
method on other objects returns NaN
*, which if evaluated first prevents the second condition to be evaluated. So the main problem remains.
*(can be chosen with str.startswith("a", na=False)
as @ayhan noticed)