Logical OR on a subset of columns in a DataFrame

Question

I want to get all rows where (at least) one of the columns in df[mylist] contains True.

I'm currently doing:

df = df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ]

where mylist is a list of strings relating to columns of df. But I would like to do this for any length of mylist.

The only way I can think of to do this is to loop over mylist and create an new dataframe for each element of it and merge/concat or whatever them afterwards. But that doesn't look very smart for me.

Is there a better way?

sorry are you saying that `mylist` is a string of col names like `['col1', 'col2', col3']`? If so why don't you just do `df[myList]`? — EdChum, Jul 24 '15 at 08:59

unutbu · Accepted Answer · 2015-07-24T13:04:01.113

Building on LondonRob's answer, you could use

df.loc[df[mylist].any(axis=1)]

Calling the DataFrame's any method will perform better than using apply to call Python's builtin any function once per row.

Or you could use np.logical_or.reduce:

df.loc[np.logical_or.reduce(df[mylist], axis=1)]

For large DataFrames, using np.logical_or may be quicker:

In [30]: df = pd.DataFrame(np.random.binomial(1, 0.1, size=(100,300)).astype(bool))

In [31]: %timeit df.loc[np.logical_or.reduce(df, axis=1)]
1000 loops, best of 3: 261 µs per loop

In [32]: %timeit df.loc[df.any(axis=1)]
1000 loops, best of 3: 636 µs per loop

In [33]: %timeit df[df.apply(any, axis=1)]
100 loops, best of 3: 2.13 ms per loop

Note that df.any has extra features, such as the ability to skip NaNs. In this case, if the columns are boolean-valued, then there can not be any NaNs (since NaNs are float values). So np.logical_or.reduce is quicker.

import numpy as np
import pandas as pd
np.random.seed(2014)
df = pd.DataFrame(np.random.binomial(1, 0.1, size=(10,3)).astype(bool), 
                  columns=list('ABC'))
print(df)
#        A      B      C
# 0  False  False  False
# 1   True  False  False
# 2  False  False  False
# 3   True  False  False
# 4  False  False  False
# 5  False  False  False
# 6  False   True  False
# 7  False  False  False
# 8  False  False  False
# 9  False  False  False

mylist = list('ABC')
print(df[ df[mylist[0]] | df[mylist[1]] | df[mylist[2]] ])
print(df.loc[df[mylist].any(axis=1)])
print(df.loc[np.logical_or.reduce(df[mylist], axis=1)])

yields the rows where at least one of the columns is True:

       A      B      C
1   True  False  False
3   True  False  False
6  False   True  False

score 4 · Answer 2 · answered Jul 24 '15 at 11:39

There's a much simpler way to do this using python's built in any function:

In []: mylist
Out[]: ['A', 'B']

In []: df
Out[]: 
       A      B      C
0  False  False  False
1   True  False  False
2  False  False  False
3   True  False  False
4  False  False  False
5  False  False  False
6  False   True  False
7  False  False  False
8  False  False  False
9  False  False  False

You can apply the function any along the rows of df by using axis=1. In this case I'll only apply any to a subset of the columns:

In []: df[mylist].apply(any, axis=1)
Out[]: 
0    False
1     True
2    False
3     True
4    False
5    False
6     True
7    False
8    False
9    False
dtype: bool

This gives us the perfect way to select our rows:

In []: df[df[mylist].apply(any, axis=1)]
Out[]: 
       A      B      C
1   True  False  False
3   True  False  False
6  False   True  False

After re-reading the now clearer question I realise my answer had nothing to do with what the OP had wanted albeit the OP had worded their question in a highly obfuscated way — EdChum, Jul 24 '15 at 12:26

Logical OR on a subset of columns in a DataFrame

2 Answers2

Linked

Related