5

Sample data to illustrate:

import pandas as pd

animals = pd.DataFrame({'name': ['ostrich', 'parrot', 'platypus'],
                        'legs': [2, 2, 4],
                        'flight': [False, True, False],
                        'beak': [True, True, True],
                        'feathers': [True, True, False]})
name legs flight beak feathers
ostrich 2
parrot 2
platypus 4

What already works

Pandas makes it easy to check an entire column (which is a series) against a condition, and the result (a series of booleans) can be used to filter the dataframe with boolean indexing:

bipeds = (animals.legs == 2)
print(animals[bipeds])

          name  legs  flight  beak  feathers
0      ostrich     2   False  True      True
1       parrot     2    True  True      True

In my use case, each such condition is being parsed from a term in a text search string, so I need to construct them programmatically. (I'm aware of Pandas' query, but I need different functionality.) Writing a function to do this is pretty straightforward:

def comp_search(df, column_name, comp, value):
    return getattr(df[column_name], f'__{comp}__')(value)

bipeds = comp_search(animals, 'legs', 'eq', 2)

Checking any given boolean column is as simple as, for instance, animals[animals.feathers].

What I'd like to do

I want to perform set comparisons against the collection of boolean columns: finding, for instance, all animals that have at least a certain set of features, or less than a set, etc. Extrapolating from earlier, I can picture such a condition looking like this:

set(df[features]) <= set(values)

And such a condition could hypothetically be built like so:

def set_comp_search(df, column_names, comp, values):
    return getattr(set(df[column_names]), f'__{comp}__')(set(values))

Of course neither of these work, as set() of a dataframe creates an ordinary set of its column names.

What works, but is highly inefficient

The above can be achieved by using apply to convert each row of booleans into a set, then comparing against the resulting series of sets:

def row_to_set(row):
    return set(label for label, value
               in zip(row.index, row)
               if value)

def set_comp_search(df, column_names, comp, values):
    series_of_sets = df[column_names].apply(row_to_set, axis=1)
    return getattr(series_of_sets, f'__{comp}__')(set(values))

Nice and concise! Unfortunately, iterating with apply becomes exceedingly slow when the source dataframe grows to thousands of rows long.

What works, but seems like a reimplementation

If I hardcode an equivalent boolean expression for each individual set comparison like this, the resulting comparisons are vectorized (performed on entire columns, rather than iterated at the Python level).

def set_comp_search(df, column_names, comp, values):
    other_column_names = set(column_names) - set(values)
    value_columns = df[values]
    other_columns = df[other_column_names]
    
    if comp == 'gt':
        # All the searched features, and at least one other
        return value_columns.all(axis=1) & other_columns.any(axis=1)

    if comp == 'ge':
        # All the searched features
        return value_columns.all(axis=1)
    
    if comp == 'eq':
        # All the searched features, and none other
        return value_columns.all(axis=1) & ~other_columns.any(axis=1)
    
    if comp == 'le':
        # No other features
        return ~other_columns.any(axis=1)
    
    if comp == 'lt':
        # Not all of the searched features, and none other
        return ~value_columns.all(axis=1) & ~other_columns.any(axis=1)

So if I want a condition to represent set(animals[features]) > {'beak'}:

more_than_beak = set_comp_search(animals, {'flight', 'beak', 'feathers'},
                                 'gt', {'beak'})
# Converts to: (animals.beak) & (animals.flight | animals.feathers)
print(animals[more_than_beak])

          name  legs  flight  beak  feathers
0      ostrich     2   False  True      True
1       parrot     2    True  True      True

# Correctly omits the platypus

Clunkiness aside, this runs sufficiently quickly. But I feel like I must be reinventing a wheel. This seems like a roughly similar use-case to what the Series.str methods do, though it would need to operate with a dataframe, a sequence of series, or a numpy array, rather than a single series. (Sadly there is no DataFrame.set module.)

So my question is: Does Pandas provide a vectorized method for set-like comparison against collections of boolean columns?

(I've also looked at this question, since it sounds similar, but it's not applicable to set-like behavior.)

CrazyChucky
  • 3,263
  • 4
  • 11
  • 25
  • I’m voting to close this question because this type of open-ended, opinion based, question is off-topic for Stack Overflow. [Code Review](https://codereview.stackexchange.com/) is a probably a better venue for this [type](https://codereview.stackexchange.com/help/on-topic) of question. Please review [help/on-topic] for Stack Overflow questions. – Trenton McKinney Dec 23 '20 at 20:06
  • 1
    @TrentonMcKinney That page you linked mentions specifically that Code Review questions have to be about real code from a project. This code runs and works, but it's clearly a simplification of what I'm actually doing. I'm not sure if it would be applicable there. – CrazyChucky Dec 23 '20 at 20:12
  • @TrentonMcKinney I've also edited to clarify that while I can code my *own* implementation, I'm really trying to see if *Pandas itself* provides some sort of method/syntax/technique for this type of comparison. Thank you for helping me notice my vagueness. – CrazyChucky Dec 23 '20 at 21:34
  • 1
    @CrazyChucky I think you can for sure simplify the `set_comp_search` function..No need to use `reduce` and `operator` – Shubham Sharma Dec 26 '20 at 16:22
  • 1
    @ShubhamSharma Good call! I've edited it to use Pandas' `any`/`all` methods instead. Probably more efficient that way. – CrazyChucky Dec 26 '20 at 16:37

1 Answers1

0

It seems to me like you might benefit from a function that is vectorized with numpy. Here is an example of such a function, the vectorize, and the application thereof:

def analyze_birds (name: str, legs: int, feathers: bool):
  if feathers and legs == 2 :
    return name + "-Feathered Biped"
  if legs > 2 :
    return name + "-Quadruped" 

vector_analyze_birds = np.vectorize(analyze_birds) 

animals['Analysis'] = vector_analyze_birds(animals['name'], animals['legs'], animals['feathers'])

Output

git_rekt
  • 54
  • 3
  • Thank you for the suggestion—I was not aware of Numpy's `vectorize` method. However, its [documentation](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html) says it is "provided primarily for convenience, not for performance. The implementation is essentially a for loop." It still has to iterate at the Python level, calling the function repeatedly—much like my inefficient `apply` attempt—rather than iterating the C level, like Numpy's and Pandas' own vectorized methods. – CrazyChucky Mar 19 '21 at 16:15
  • That is all true! I think it is worth mentioning that in my own practice transforming GBs and sometimes TBs of data files mostly using numpy vectorize my longest function takes about 60-90 seconds. I would be interested to learn why that is too slow for your application. In some cases I am iterating over 3-4 columns and multiple dataframes too, cheers! – git_rekt Mar 19 '21 at 17:00
  • 60-90 seconds would be a long time to wait for a search engine query—ideally the results are essentially instant. This set comparison cares about five columns, and the data I'm searching is about 40,000 rows. (A given text query could include *multiple* set conditions, too: not these columns, at least those columns, etc.) – CrazyChucky Mar 19 '21 at 17:17