Efficient chaining of boolean indexers in pandas DataFrames

Question

I am trying to very efficiently chain a variable amount of boolean pandas Series, to be used as a filter on a DataFrame through boolean indexing.

Normally when dealing with multiple boolean conditions, one chains them like this

condition_1 = (df.A > some_value)
condition_2 = (df.B <= other_value)
condition_3 = (df.C == another_value)
full_indexer = condition_1 & condition_2 & condition_3

but this becomes a problem with a variable amount of conditions.

bool_indexers = [
    condition_1,
    condition_2,
    ...,
    condition_N,
    ]

I have tried out some possible solutions, but I am convinced it can be done more efficiently.

Option 1
Loop over the indexers and apply consecutively.

full_indexer = bool_indexers[0]
for indexer in bool_indexers[1:]:
    full_indexer &= indexer

Option 2
Put into a DataFrame and calculate the row product.

full_indexer = pd.DataFrame(bool_indexers).product(axis=0)

Option 3
Use numpy.product (like in this answer) and create a new Series out of the result.

full_indexer = pd.Series(np.prod(np.vstack(bool_indexers), axis=0))

All three solutions are somewhat inefficient because they rely on looping or force you to create a new object (which can be slow if repeated many times).

Can it be done more efficiently or is this it?

score 4 · Accepted Answer · answered Feb 15 '22 at 21:29

4

Use np.logical_and:

import pandas as pd
import numpy as np

df = pd.DataFrame({'A': [0, 1, 2], 'B': [0, 1, 2], 'C': [0, 1, 2]})
m1 = df.A > 0
m2 = df.B <= 1
m3 = df.C == 1

m = np.logical_and.reduce([m1, m2, m3])
# OR m = np.all([m1, m2, m3], axis=0)

out = df[np.logical_and.reduce([m1, m2, m3])]

Output:

>>> pd.concat([m1, m2, m3], axis=1)
       A      B      C
0  False   True  False
1   True   True   True
2   True  False  False

>>> m
array([False,  True, False])

>>> out
   A  B  C
1  1  1  1

answered Feb 15 '22 at 21:29

Corralien

109,409
8
28
52

1

they could also use `reduce(pd.core.arraylike.operator.and_, [m1, m2, m3])` but using `numpy` feels more natural and is (much) faster. – Pierre D Feb 16 '22 at 17:32
`reduce` from `functools`? – Corralien Feb 16 '22 at 17:35
indeed, `functools.reduce`. Believe it or not, I still use it from time to time. – Pierre D Feb 16 '22 at 17:36
I used too :) [here](https://stackoverflow.com/a/70558011/15239951) or [here](https://stackoverflow.com/a/68369856/15239951) and in many other [answers](https://stackoverflow.com/search?tab=relevance&q=user%3a15239951%20reduce%20functools) – Corralien Feb 16 '22 at 17:40
@Corralien, I timed the different one-liner notations (in order of decreasing performance): `pd.Series(np.prod(np.vstack(indexers), axis=0))` -> 332 µs ± 113 µs `np.all(indexers, axis=0)` -> 8.15 ms ± 427 µs `pd.DataFrame(indexers).product(axis=0)` -> 154 ms ± 4.18 ms I am actually quite surprised that the `np.prod` + `pd.Series` approach is so efficient – Wouter Feb 17 '22 at 12:04

Efficient chaining of boolean indexers in pandas DataFrames

1 Answers1