1

Let's say I have a dataframe df with an arbitrary number of columns. As an example, say we have

   a    b    c
0  5    foo  2
1  5    bar  3
2  4    foo  2
3  5    test 1
4  4    bar  7

Suppose I want a filter like

df[(df['a'] == 5) & (~df['b'].isin(['foo','bar'])) & (df['c'].isin(range(5)))]

or maybe something like df[(df['a'] == 5) & (~df['b'].isin(['test','bar'])) | (df['c'].isin(range(5)))]

but I want something that can easily be plugged in as an input, something like:

def filter_df(filter_kwargs, df):
    # do the filtering here

I have an idea what to do with the == operator, but I'm perplexed how to do more complicated ones, like .isin and |. What's the best way to do this?

sbha
  • 9,802
  • 2
  • 74
  • 62
irene
  • 2,085
  • 1
  • 22
  • 36
  • 1
    I think you need to be more precise with an example. – B. Hel Oct 24 '18 at 05:08
  • @B.Hel I added an example above. Hopefully that helps. – irene Oct 24 '18 at 05:16
  • If I understand your problem your want to filter only by isin filters (==5 is equal to isin ([5]) ? – B. Hel Oct 24 '18 at 05:21
  • I am really sorry but I don't understand why do you want to do that. The best is to filter before your function and the filtering function is not really long. – B. Hel Oct 24 '18 at 05:24
  • @B.Hel I have a number of dataframes with different columns that have different filters (several combinations). It would be very cumbersome if I have to manually filter each one of them, so I was thinking of using a helper function that will take in a dictionary instead as an input for filtering. – irene Oct 24 '18 at 05:26
  • Instead of using an helper function, have you try to make your filtering by step? Like that you can easily control which one do you use. – B. Hel Oct 24 '18 at 05:44
  • Yeah I have. That's pretty easy. It's just that if there were a helper function for this, it would make the code a lot cleaner. – irene Oct 24 '18 at 06:12
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/182396/discussion-between-b-hel-and-irene). – B. Hel Oct 24 '18 at 06:51

3 Answers3

2

I have here three solutions. In my opinion the most elegant ones are the first two. The third one feels more like a 'hack' but could be used as an inspiration for something else.

import pandas as pd
df = pd.DataFrame({'a': [5,5,4,5,4], 'b': ['foo','bar','foo','test','bar'],'c': [2,3,2,1,7]})

Example 1 - same thing you provided but splitted and thus more readable:

mask_1 = (df['a'] == 5) & \
        (~df['b'].isin(['foo','bar'])) & \
        (df['c'].isin(range(5)))

print(df.loc[mask_1])

Example 2 - using lambda functions so it is standard to use because the conditions look like somewhere else (==, not in, in, >, <, etc...):

mask_2 = (df['a'].apply(lambda x: x == 5)) & \
        (df['b'].apply(lambda x: x not in ['foo', 'bar'])) & \
        (df['c'].apply(lambda x: x in range(5)))

print(df.loc[mask_2])

Example 3 - inspired on B. Hel's answer and more general

def filter_df(filter_kwargs, df):
    l = len(filter_kwargs)

    for i, cond in enumerate(filter_kwargs):
        eval_cond = df[cond[0]].apply(lambda x: eval("x " + cond[1]))

        if i == 0:
            mask = eval_cond

        elif i+1 == l:
            break

        else:
            mask = eval('mask' + filter_kwargs[i-1][2] + 'eval_cond')

    return df.loc[mask]

# Format for each condition [[column_name, condition, AND_OR],...]
filter_kwargs = [['a', '==5', '&'],['b', 'not in ["foo", "bar"]','&'], ['c', 'in range(5)','|']]    
print(filter_df(filter_kwargs,df))
Carlos Azevedo
  • 660
  • 3
  • 13
  • Interesting approach! I wonder though how the `apply` function would work for larger dataframes. – irene Oct 24 '18 at 08:37
1

This is an idea of solution

import pandas as pd

df = pd.DataFrame({'a': [5,5,4,5,4], 'b': ['foo','bar','foo','test','bar'],'c': [2,3,2,1,7]})


def helper_function(df, *argv):
    x = True
    y = "and"
    for (i,arg) in enumerate(argv):
        if (i % 2 == 1):
            y = arg
        else:
            if (y == "and"):
                x = x & df[arg[0]].isin(arg[1])
            else:
                x = x | df[arg[0]].isin(arg[1])
    return df[x]

print(helper_function(df, ['a',[5]],"and",['b',['test','bar']],"and",['c',[0,1,2]]))
B. Hel
  • 170
  • 10
  • Thanks. This should work for my case, although it is limited to `isin` and so you need to know which values you want to filter out in advance. – irene Oct 24 '18 at 08:31
  • If you want, we can make more generalized. – B. Hel Oct 24 '18 at 08:36
  • Thanks. My main concern is how to do I use the `isin` function in a dictionary. Carlos above has a suggestion that uses an `apply` method instead of using pandas built-in methods, although I'm not sure if that would be as fast as using the built-in ones like `isin`. – irene Oct 24 '18 at 08:38
  • can you explain how you want to use isin function in a dictionary ? – B. Hel Oct 24 '18 at 08:47
  • Maybe something like this: {'b': [some_generic_fxn_that_could_be_isin, list_or_value]} – irene Oct 24 '18 at 09:26
  • the best is to use `getattr`. For exemple you can use `getattr(df[x],di[x][0])(di[x][1])` where `di = {'b': [some_generic_fxn_that_could_be_isin, list_or_value]}`. Do you want that I write your complete code? – B. Hel Oct 24 '18 at 10:51
  • Thank you @B. Hel! That would be great. – irene Oct 24 '18 at 11:58
1

Assume that you have this preamble

import pandas as pd
df = pd.DataFrame({'a': [5,5,4,5,4], 'b': ['foo','bar','foo','test','bar'],'c': [2,3,2,1,7]})

and this function

def helper_function(df,d):
    x = True
    for (i,k) in enumerate(d):
        y = getattr(df[k['key']],k['function'])(k['values'])
        if k['isnot']:
            y = getattr(getattr(y,'__ne__'),'__self__')
        if i == 0:
            x = y
        else:
            x = getattr(x,k['left_connector'])(y)
    return x

Now you can create a list of dictionnaries

di = [
    {
        'key': 'a',
        'function': 'isin',
        'isnot': False,
        'values': [5],
        'left_connector': "__and__"
    },
    {
        'key': 'b',
        'function': 'isin',
        'isnot': True,
        'values': ['test','bar'],
        'left_connector': "__and__"
    },
    {
        'key': 'c',
        'function': 'isin',
        'isnot': False,
        'values': [0,1,2,3],
        'left_connector': "__or__"
    },
]

and use this code to filter

df[helper_function(df,di)]

As you use only functions of pandas, you keep the performance of pandas.

B. Hel
  • 170
  • 10