Sample data to illustrate:
import pandas as pd
animals = pd.DataFrame({'name': ['ostrich', 'parrot', 'platypus'],
'legs': [2, 2, 4],
'flight': [False, True, False],
'beak': [True, True, True],
'feathers': [True, True, False]})
name | legs | flight | beak | feathers |
---|---|---|---|---|
ostrich | 2 | ✔ | ✔ | |
parrot | 2 | ✔ | ✔ | ✔ |
platypus | 4 | ✔ |
What already works
Pandas makes it easy to check an entire column (which is a series) against a condition, and the result (a series of booleans) can be used to filter the dataframe with boolean indexing:
bipeds = (animals.legs == 2)
print(animals[bipeds])
name legs flight beak feathers
0 ostrich 2 False True True
1 parrot 2 True True True
In my use case, each such condition is being parsed from a term in a text search string, so I need to construct them programmatically. (I'm aware of Pandas' query, but I need different functionality.) Writing a function to do this is pretty straightforward:
def comp_search(df, column_name, comp, value):
return getattr(df[column_name], f'__{comp}__')(value)
bipeds = comp_search(animals, 'legs', 'eq', 2)
Checking any given boolean column is as simple as, for instance, animals[animals.feathers]
.
What I'd like to do
I want to perform set comparisons against the collection of boolean columns: finding, for instance, all animals that have at least a certain set of features, or less than a set, etc. Extrapolating from earlier, I can picture such a condition looking like this:
set(df[features]) <= set(values)
And such a condition could hypothetically be built like so:
def set_comp_search(df, column_names, comp, values):
return getattr(set(df[column_names]), f'__{comp}__')(set(values))
Of course neither of these work, as set()
of a dataframe creates an ordinary set of its column names.
What works, but is highly inefficient
The above can be achieved by using apply
to convert each row of booleans into a set, then comparing against the resulting series of sets:
def row_to_set(row):
return set(label for label, value
in zip(row.index, row)
if value)
def set_comp_search(df, column_names, comp, values):
series_of_sets = df[column_names].apply(row_to_set, axis=1)
return getattr(series_of_sets, f'__{comp}__')(set(values))
Nice and concise! Unfortunately, iterating with apply
becomes exceedingly slow when the source dataframe grows to thousands of rows long.
What works, but seems like a reimplementation
If I hardcode an equivalent boolean expression for each individual set comparison like this, the resulting comparisons are vectorized (performed on entire columns, rather than iterated at the Python level).
def set_comp_search(df, column_names, comp, values):
other_column_names = set(column_names) - set(values)
value_columns = df[values]
other_columns = df[other_column_names]
if comp == 'gt':
# All the searched features, and at least one other
return value_columns.all(axis=1) & other_columns.any(axis=1)
if comp == 'ge':
# All the searched features
return value_columns.all(axis=1)
if comp == 'eq':
# All the searched features, and none other
return value_columns.all(axis=1) & ~other_columns.any(axis=1)
if comp == 'le':
# No other features
return ~other_columns.any(axis=1)
if comp == 'lt':
# Not all of the searched features, and none other
return ~value_columns.all(axis=1) & ~other_columns.any(axis=1)
So if I want a condition to represent set(animals[features]) > {'beak'}
:
more_than_beak = set_comp_search(animals, {'flight', 'beak', 'feathers'},
'gt', {'beak'})
# Converts to: (animals.beak) & (animals.flight | animals.feathers)
print(animals[more_than_beak])
name legs flight beak feathers
0 ostrich 2 False True True
1 parrot 2 True True True
# Correctly omits the platypus
Clunkiness aside, this runs sufficiently quickly. But I feel like I must be reinventing a wheel. This seems like a roughly similar use-case to what the Series.str
methods do, though it would need to operate with a dataframe, a sequence of series, or a numpy array, rather than a single series. (Sadly there is no DataFrame.set
module.)
So my question is: Does Pandas provide a vectorized method for set-like comparison against collections of boolean columns?
(I've also looked at this question, since it sounds similar, but it's not applicable to set-like behavior.)