40

I am trying to filter a df using several Boolean variables that are a part of the df, but have been unable to do so.

Sample data:

A | B | C | D
John Doe | 45 | True | False
Jane Smith | 32 | False | False
Alan Holmes | 55 | False | True
Eric Lamar | 29 | True | True

The dtype for columns C and D is Boolean. I want to create a new df (df1) with only the rows where either C or D is True. It should look like this:

A | B | C | D
John Doe | 45 | True | False
Alan Holmes | 55 | False | True
Eric Lamar | 29 | True | True

I've tried something like this, which faces issues because it cant handle the Boolean type:

df1 = df[(df['C']=='True') or (df['D']=='True')]

Any ideas?

MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
Maya Harary
  • 417
  • 1
  • 4
  • 7

5 Answers5

76
In [82]: d
Out[82]:
             A   B      C      D
0     John Doe  45   True  False
1   Jane Smith  32  False  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 1:

In [83]: d.loc[d.C | d.D]
Out[83]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 2:

In [94]: d[d[['C','D']].any(1)]
Out[94]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Solution 3:

In [95]: d.query("C or D")
Out[95]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

PS If you change your solution to:

df[(df['C']==True) | (df['D']==True)]

it'll work too

Pandas docs - boolean indexing


why we should NOT use "PEP complaint" df["col_name"] is True instead of df["col_name"] == True?

In [11]: df = pd.DataFrame({"col":[True, True, True]})

In [12]: df
Out[12]:
    col
0  True
1  True
2  True

In [13]: df["col"] is True
Out[13]: False               # <----- oops, that's not exactly what we wanted
Steven D.
  • 493
  • 4
  • 12
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Man, I think most of questioners were spoiled by us ...:) Nice solution – BENY Sep 13 '17 at 22:12
  • Wow so much simpler than I had expected, thank you! I ended up using solution 3 because I actually had 4 boolean variables in my actual dataset and that one was the neatest - worked like a charm! I didn't realize that bools worked like that, i.e. that I didn't to define the content of the bool (1/0, True/False) and that it automatically assumes True. Thanks again! – Maya Harary Sep 14 '17 at 00:16
  • Wow, you've covered all the basics. I had a hard time coming up with answers :p – cs95 Sep 14 '17 at 00:42
  • if it has to be exactly 1, you could use sum `df[df[].sum(axis=1) == 1` – Maarten Fabré Sep 14 '17 at 10:19
  • @MaartenFabré, yes, we can still do `df[df[].sum(axis=1) > 0]`, but `df[df[].any(axis=1)]` is more idiomatic IMO... – MaxU - stand with Ukraine Sep 14 '17 at 10:20
  • that's why I would only use it if exactly 1 `True` value is asked. The OP used `either`, which may indicate one or the other, but not both – Maarten Fabré Sep 14 '17 at 10:21
  • 3
    `df[(df.var == True)].count()` fires E712 (use is True instead of ==) for me – kev Jul 31 '19 at 07:19
  • 1
    @kev, thank you for your advise, but `(df['C'] is True)` will __always__ return `False`, independently from the data. I don't think this will improve my answer ;) – MaxU - stand with Ukraine Jul 31 '19 at 07:37
  • 1
    @kev, i wouldn't intentionally break the code (`df["col_name"] is True` - produces a __wrong result__) just to be PEP complaint. I have added an example in the answer. BTW there are three options in my answer and all of them are PEP complaint ;) – MaxU - stand with Ukraine Jul 31 '19 at 08:15
15

Hooray! More options!

np.where

df[np.where(df.C | df.D, True, False)]

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True  

pd.Series.where on df.index

df.loc[df.index.where(df.C | df.D).dropna()]

               A   B      C      D
0.0     John Doe  45   True  False
2.0  Alan Holmes  55  False   True
3.0   Eric Lamar  29   True   True

df.select_dtypes

df[df.select_dtypes([bool]).any(1)]   

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True

Abusing np.select

df.iloc[np.select([df.C | df.D], [df.index])].drop_duplicates()

             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True
cs95
  • 379,657
  • 97
  • 704
  • 746
5

Or

d[d.eval('C or D')]

Out[1065]:
             A   B      C      D
0     John Doe  45   True  False
2  Alan Holmes  55  False   True
3   Eric Lamar  29   True   True
BENY
  • 317,841
  • 20
  • 164
  • 234
1

So, the easiest way to do this:

students = [ ('jack1', 'Apples1' , 341) ,
             ('Riti1', 'Mangos1'  , 311) ,
             ('Aadi1', 'Grapes1' , 301) ,
             ('Sonia1', 'Apples1', 321) ,
             ('Lucy1', 'Mangos1'  , 331) ,
             ('Mike1', 'Apples1' , 351),
              ('Mik', 'Apples1' , np.nan)
              ]
#Create a DataFrame object
df = pd.DataFrame(students, columns = ['Name1' , 'Product1', 'Sale1']) 
print(df)


    Name1 Product1  Sale1
0   jack1  Apples1    341
1   Riti1  Mangos1    311
2   Aadi1  Grapes1    301
3  Sonia1  Apples1    321
4   Lucy1  Mangos1    331
5   Mike1  Apples1    351
6     Mik  Apples1    NaN

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’,
subset = df[df['Product1'] == 'Apples1']
print(subset)

 Name1 Product1  Sale1
0   jack1  Apples1    341
3  Sonia1  Apples1    321
5   Mike1  Apples1    351
6     Mik  Apples1    NA

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND notnull value in Sale

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'].notnull())]
print(subsetx)
    Name1   Product1    Sale1
0   jack1   Apples1      341
3   Sonia1  Apples1      321
5   Mike1   Apples1      351

# Select rows in above DataFrame for which ‘Product’ column contains the value ‘Apples’, AND Sale = 351

subsetx= df[(df['Product1'] == "Apples1")  & (df['Sale1'] == 351)]
print(subsetx)

   Name1 Product1  Sale1
5  Mike1  Apples1    351

# Another example
subsetData = df[df['Product1'].isin(['Mangos1', 'Grapes1']) ]
print(subsetData)

Name1 Product1  Sale1
1  Riti1  Mangos1    311
2  Aadi1  Grapes1    301
4  Lucy1  Mangos1    331

Here is the source of this code: https://thispointer.com/python-pandas-select-rows-in-dataframe-by-conditions-on-multiple-columns/
I added minor changes to it.

Community
  • 1
  • 1
rra
  • 809
  • 1
  • 8
  • 20
-2

you could try this easily:

df1 = df[(df['C']=='True') | (df['D']=='True')]

Note:

  1. The or logical operator needs to be replaced by the bitwise | operator.
  2. Ensure that () are used to enclose each of the operands.
samuelv
  • 7
  • 2
  • 2
    This is similar to answers already given half a year before and doesn't work. Please try out on the example data the next time. – Nico Albers Feb 24 '18 at 21:05