2

I used to the following when altering a dataframe column based on a condition (in this case, every woman gets a wage of 200).

import pandas as pd
df = pd.DataFrame([[False,100],[True,100],[True,100]],columns=['female','wage'])
df.loc[df['female'] == True,'wage'] = 200

The PEP 8 Style convention checker (in Spyder) recommends in line 3:

comparison to True should be 'if cond is True:' or 'if cond:'

Changing the last row to

df.loc[df['female'] is True,'wage'] = 200

yields

KeyError: 'cannot use a single bool to index into setitem'

because now the statement is evaluated to a single boolean value and not to a Series.

Is this a case where one has to deviate from styling conventions?

Georgy
  • 12,464
  • 7
  • 65
  • 73
E. Sommer
  • 710
  • 1
  • 7
  • 28
  • Insofar as it doesn't work otherwise, Yes. – ifly6 Jul 12 '18 at 20:43
  • [This question](https://stackoverflow.com/questions/4050335/strange-pep8-recommandation-on-comparing-boolean-values-to-true-or-false) may help you. – harvpan Jul 12 '18 at 20:46
  • What happens if you make that `df.loc[(df['female'] is True),'wage']`? I don't know offhand the associativity rules for `is`, so it's possible that it's being interpreted as `df.loc[df['female'] is (True,'wage')]` which I don't think is what you mean. – BowlingHawk95 Jul 12 '18 at 20:51
  • this yields the same Key Error – E. Sommer Jul 12 '18 at 20:57

2 Answers2

4

You should use df['female'] with no comparison, rather than comparing to True with any operator. df['female'] is already the mask you need.

Comparison to True with == is almost always a bad idea, even in NumPy or Pandas.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • 1
    Don't want to add a new answer, but I think this should include a mention of `1 == True # True` vs `1 is True # False` – MoxieBall Jul 12 '18 at 20:45
  • OK, I also do that occasioanlly . But what if I need to check if a condition is false? `~df['female']`? – E. Sommer Jul 12 '18 at 20:47
  • 1
    @E.Sommer: Yes, use `~`. – user2357112 Jul 12 '18 at 20:47
  • What if `df['female']` also has non-`bool` values, such as strings? Not checking `== True` will select those rows as well. – Michael Litvin Apr 14 '19 at 13:03
  • @MichaelLitvin: In that case, first, you've got a really weird column there, and second, `== True` will also select any rows with value `1`, `1.0`, or anything else that compares equal to `True`. In that case, you might need something like `df['female'].apply(lambda x: x is True)`, or something even weirder if you want to allow instances of `numpy.bool_` or other bool-like types. – user2357112 Apr 14 '19 at 20:00
2

Just do

df.loc[df['female'], 'wage'] = 200 

In fact df['female'] as a Boolean series has exactly the same values as the Boolean series returned by evaluating df['female'] == True, which is also a Boolean series. (A Series is the Pandas term like a single column in a dataframe).

By the way, the last statement is precisely why df['female'] is True should never work. In Python, the is operator is reserved for object identity, not for comparing values for equality. df['female'] will always be a Series (if df is a Pandas dataframe) and a Series will never be the same (object) as the single

To understand this better think of the difference, in English, between 'equal' and 'same'. In German, this is the difference between 'selbe' (identity) and 'gleiche' (equality). In other languages, this distinction is not as explicit.

Thus, in Python, you can compare a (reference to an) object to (the special object) None with : if obj is None : ... or even check that two variables ('names' in Python terminology) point to the exact same object with if a is b. But this condition holding is a much stronger assertion than just comparing for equality a == b. In fact the result of evaluating the expression a == b might be anything, not just a single Boolean value. It all depends on what class a belongs to, that is, what its type is. In your context a == b actually yields a boolean Series, provided both a and b are also a Pandas Series.

By the way if you want to check that all values agree between two Series a and b then you should evaluate (a == b).all() which reduces the whole series to a single Boolean value, which will be True if and only if a[i] == b[i] for every value of i.

Mateo
  • 1,494
  • 1
  • 18
  • 27