Expressions with "== True" and "is True" give different results

Question

I have the following MCVE:

#!/usr/bin/env python3                                           

import pandas as pd

df = pd.DataFrame([True, False, True])

print("Whole DataFrame:")
print(df)

print("\nFiltered DataFrame:")
print(df[df[0] == True])

The output is the following, which I expected:

Whole DataFrame:
     0
  0  True
  1  False
  2  True

Filtered DataFrame:
     0
  0  True
  2  True

Okay, but the PEP8 style seems to be wrong, it says: E712 comparison to True should be if cond is True or if cond. So I changed it to is True instead of == True but now it fails, the output is:

Whole DataFrame:
    0
0   True
1  False
2   True

Filtered DataFrame:
0     True
1    False
2     True
Name: 0, dtype: bool

What is going on?

"Okay, but the PEP8 style seems to be wrong, it says: E712 comparison to True should be if cond is True or if cond.". WTF? PEP8 actually says "Yes: `if greeting`, No: `if greeting == True`, Worse `if greeting is True`". — Matteo Italia, Apr 24 '16 at 16:54
@IanS It may be (not as readable IMO), but the question is not about that :^) "What's the difference between a clever man and a wise man? - A clever man gets out of any trouble with flying colors, a wise man doesn't get _into_ it." — ivan_pozdeev, Apr 26 '16 at 19:15

ivan_pozdeev · Accepted Answer · 2020-06-27T14:10:01.870

The catch here is that in df[df[0] == True], you are not comparing objects to True.

As the other answers say, == is overloaded in pandas to produce a Series instead of a bool as it normally does. [] is overloaded, too, to interpret the Series and give the filtered result. The code is essentially equivalent to:

series = df[0].__eq__(True)
df.__getitem__(series)

So, you're not violating PEP8 by leaving == here.

Essentially, pandas gives familiar syntax unusual semantics - that is what caused the confusion.

According to Stroustroup (sec.3.3.3), operator overloading has been causing trouble due to this ever since its invention (and he had to think hard whether to include it into C++). Seeing even more abuse of it in C++, Gosling ran to the other extreme in Java, banning it completely, and that proved to be exactly that, an extreme.

As a result, modern languages and code tend to have operator overloading but watch closely not to overuse it and for semantics to stay consistent.

I do not think he recommended to use that. Using `==` is perfectly fine. — MaxNoe, Apr 24 '16 at 17:09
@ClaudiuCreanga PyCharm is smart, but not THAT smart :-) This is exactly why every such checker worth its salt has an option to tell it: "Shut up, I know what I am doing." — ivan_pozdeev, Feb 27 '18 at 14:22

score 9 · Answer 2 · answered Apr 24 '16 at 16:49

9

In python, is tests if an object is the same as another. == is defined by a pandas.Series to act element-wise, is is not.

Because of that, df[0] is True compares if df[0] and True are the same object. The result is False, which in turn is equal to 0, so you get the 0 columns when doing df[df[0] is True]

answered Apr 24 '16 at 16:49

MaxNoe

14,470
3
41
46

So, it's basically a bug due to a lack of an overload? – ivan_pozdeev Apr 24 '16 at 16:53
4

You cannot overload `is`. – MaxNoe Apr 24 '16 at 16:53

afrendeiro · Answer 3 · 2019-03-03T14:23:01.673

One workaround for not having complaints from linters but still reasonable syntax for sub-setting could be:

s = pd.Series([True] * 10 + [False])

s.loc[s == True]  # bad comparison in Python's eyes
s.loc[s.isin([True])]  # valid comparison, not as ugly as s.__eq__(True)

Both also take the same time.

In addition, for dataframes one can use query:

df = pd.DataFrame([
        [True] * 10 + [False],
        list(range(11))],
    index=['T', 'N']).T
df.query("T == True")  # also okay

score 2 · Answer 4 · edited May 23 '17 at 12:31

2

I think in pandas comparison only works with == and result is boolean Series. With is output is False. More info about is.

print df[0] == True
0     True
1    False
2     True
Name: 0, dtype: bool

print df[df[0]]
      0
0  True
2  True

print df[df[0] == True]
      0
0  True
2  True

print df[0] is True
False

print df[df[0] is True]
0     True
1    False
2     True
Name: 0, dtype: bool

edited May 23 '17 at 12:31

Community

1
1

answered Apr 24 '16 at 16:42

jezrael

822,522
95
1,334
1,252

And it's done like that, because `==` may be redefined for custom classes and `is` may not. It's the same case like `SQLAlchemy` where clauses - OP has to either ignore warning or disable it with `noqa` comment. – Łukasz Rogalski Apr 24 '16 at 16:46

score 2 · Answer 5 · answered Apr 24 '16 at 20:11

This is an elaboration on MaxNoe's answer since this was to lengthy to include in the comments.

As he indicated, df[0] is True evaluates to False, which is then coerced to 0 which corresponds to a column name. What is interesting about this is that if you run

>>>df = pd.DataFrame([True, False, True])
>>>df[False]
KeyError                                  Traceback (most recent call last)
<ipython-input-21-62b48754461f> in <module>()
----> 1 df[False]

>>>df[0]
0     True
1    False
2     True
Name: 0, dtype: bool
>>>df[False]
0     True
1    False
2     True
Name: 0, dtype: bool

This seems a bit perplexing at first (to me at least) but has to do with how pandas makes use of caching. If you look at how df[False] is resolved, it looks like

  /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1975)__getitem__()
-> return self._getitem_column(key)
  /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/frame.py(1999)_getitem_column()
-> return self._get_item_cache(key)
> /home/matthew/anaconda/lib/python2.7/site-packages/pandas/core/generic.py(1343)_get_item_cache()
-> res = cache.get(item)

Since cache is just a regular python dict, after running df[0] the cache looks like

>>>cache
{0: 0     True
1    False
2     True
Name: 0, dtype: bool}

so that when we look up False, python coerces this to 0. If we have not already primed the cache using df[0], then res is None which triggers a KeyError on line 1345 of generic.py

def _get_item_cache(self, item):
1341            """Return the cached item, item represents a label indexer."""
1342            cache = self._item_cache
1343 ->         res = cache.get(item)
1344            if res is None:
1345                values = self._data.get(item)

this is not really related to the original question any more but very interesting... — Tadhg McDonald-Jensen, May 14 '16 at 03:17

Expressions with "== True" and "is True" give different results

5 Answers5

Linked