1

I am dealing with pandas dataframes that have some uniform labelling schema for the columns, but an arbitrary number of columns.

For example the following df with columns col_names, a special subset of the columns, and a filter criteria corresponding to those special columns. They're linked by index, or with a dictionary, take your pick.

col_names = ['col1','col2','col3','xyz.1','xyz.2','xyz.3']
relevant_col_names = ['xyz.1', 'xyz.2', 'xyz.3']
filter_criteria = [something1, something2, something3]

I want to get the subset dataframe where 'xyz.1' == something1 & 'xyz.2' == something2 & 'xyz.3' ==something3

I would normally do this by:

whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3]

The issue is I need to be able to write that expression for an arbitrary number of 'xyz' columns without having to manually code the above expression.

For example, if the df happens to have 5 relevant columns like this:

['col1','col2','col3','xyz.1','xyz.2','xyz.3','xyz.4','xyz.5']

Then I would need to automatically write something like this:

whatIwant = df.loc[df['xyz.1']==something1 & df['xyz.2']==something2 & df['xyz.3']==something3 & df['xyz.4']==something4 & df['xyz.5']==something5]

Is there a way to write a boolean expression of arbitrary length based on a list or a dictionary (or something else), where I'm ANDing everything inside of it?

I don't know how to phrase this question for google. It seems related to list and dictionary comprehension, expression synthesis, or something else. Even knowing the right way to phrase this question, how to tag this question for stackoverflow, and/or what I should google would be very helpful. Thanks!

  • Main problem seems to me the "something*"s. Are these single variables (bad) or items in a list (good) or something else? – Michael Butscher Sep 09 '21 at 20:59
  • You want all the filters to apply? Did you try looping to apply filters, applying each one to the filtered result from the previous step? – Karl Knechtel Sep 09 '21 at 21:16
  • @MichaelButscher yes, all the "somethingK"s are in a list indexed in the same way as the corresponding "xyz.K" component – cdrootsudormstardashr Sep 09 '21 at 21:40
  • @KarlKnechtel yeah I want all the filters to apply at once, hence the one long boolean expression. That's a clever idea, that makes sense. Just make a progressively smaller dataframe as I apply each filter. Though the big O efficiency seems like it could be painful. But very practical. Thanks – cdrootsudormstardashr Sep 09 '21 at 21:45
  • 2
    *because* the dataframe gets progressively smaller, you don't have a big-O complexity problem. You're effectively doing the same short-circuiting. – Karl Knechtel Sep 09 '21 at 21:48

1 Answers1

3

Assuming you have all the "somethings" in a list, you can do this:

import functools
import operator
import pandas as pd

df = pd.DataFrame.from_dict({
    "abc": (1, 2, 3),
    "xyz.1": (4, 2, 7),
    "xyz.2": (8, 5, 5)
})
targets = [2, 5]

filtered_dfs = (df[f"xyz.{index + 1}"] == target for index, target in enumerate(targets))
filtered_df = df.loc[functools.reduce(operator.and_, filtered_dfs)]

print(filtered_df)

We construct each filter on the df in filtered_dfs by doing df[f"xyz.{index + 1}"] == target. For the first iteration through targets, this will be df["xyz.1"] == 2. For the second iteration, it will be df["xyz.2"] == 5.

We then combine all these filters with functools.reduce(operator.and_, filtered_dfs), which is like doing filtered_df_1 & filtered_df_2 & ....

We finally apply the filter to the dataframe through df.loc, which gives the rows here that have a 2 in xyz.1 and 5 in xyz.2. Output is:

   abc  xyz.1  xyz.2
1    2      2      5
Mario Ishac
  • 5,060
  • 3
  • 21
  • 52
  • elegant! thanks. I wonder what the big O performance difference would be between your answer and Karl Knetchel's answer. I'm never sure how to think about performance when dealing with pandas operations. – cdrootsudormstardashr Sep 10 '21 at 15:55
  • @IsaacHowenstine Both Karl's solution and mine are `O(nk)` time complexity because we iterate through the `k` rows of the original dataframe when applying the filter `n` times. This can be either `n` one-by-one times like Karl suggested or combining `n` filters into one filter before applying it (my solution). Both solutions are `O(k)` space complexity, since only 2 (a constant number) of dataframes / series are ever stored in memory at once, both having `k` rows. Karl would build the filters one-by-one, whereas I used a generator followed by the `functools.reduce`. – Mario Ishac Sep 10 '21 at 19:13
  • A good read for you is here: https://stackoverflow.com/questions/47789/generator-expressions-vs-list-comprehensions. Using a generator expression for `filtered_dfs` in my solution causes it to be `O(k)` space complexity, instead of `O(nk)` had I used a list comprehension. So even though I combine all filters into one before applying it, it doesn't mean all filters are stored in memory at once. Overall however, I wouldn't worry about any performance difference here unless you've profiled the program and made sure it is there. Just think of the above comparison as an interesting exercise. – Mario Ishac Sep 10 '21 at 19:17