40

I am trying to filter a pandas data frame using thresholds for three columns

import pandas as pd
df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
df = df.loc[(df.A > 0) & (df.B > 2) & (df.C > -1)].reset_index(drop = True)

df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

However, I want to do this inside a function where the names of the columns and their thresholds are given to me in a dictionary. Here's my first try that works ok. Essentially I am putting the filter inside cond variable and just run it:

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = "df = df.loc["
for key in limits_dic.keys():
    cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
cond = cond[:-2] + "].reset_index(drop = True)"
exec(cond)
df
    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

Now, finally I put everything inside a function and it stops working (perhaps exec function does not like to be used inside a function!):

df = pd.DataFrame({"A" : [6, 2, 10, -5, 3],
                   "B" : [2, 5, 3, 2, 6],
                   "C" : [-5, 2, 1, 8, 2]})
limits_dic = {"A" : 0, "B" : 2, "C" : -1}
def filtering(df, limits_dic):
    cond = "df = df.loc["
    for key in limits_dic.keys():
        cond += "(df." + key + " > " + str(limits_dic[key])+ ") & "
    cond = cond[:-2] + "].reset_index(drop = True)"
    exec(cond)
    return(df)

df = filtering(df, limits_dic)
df
    A  B  C
0   6  2 -5
1   2  5  2
2  10  3  1
3  -5  2  8
4   3  6  2

I know that exec function acts differently when used inside a function but was not sure how to address the problem. Also, I am wondering there must be a more elegant way to define a function to do the filtering given two input: 1)df and 2)limits_dic = {"A" : 0, "B" : 2, "C" : -1}. I would appreciate any thoughts on this.

cs95
  • 379,657
  • 97
  • 704
  • 746
ahoosh
  • 1,340
  • 3
  • 17
  • 31
  • if you change the name of the result (`cond = "df2 = df.loc["` and `return(locals()['df2'])`) it works. i tried to add dicts to `exec` to no avail – bobrobbob May 21 '18 at 02:10
  • For more information on the `pd.eval()` family of functions, their features and use cases, please visit [Dynamic Expression Evaluation in pandas using pd.eval()](https://stackoverflow.com/questions/53779986/dynamic-expression-evaluation-in-pandas-using-pd-eval). – cs95 Dec 16 '18 at 04:55

4 Answers4

77

If you're trying to build a dynamic query, there are easier ways. Here's one using a list comprehension and str.join:

query = ' & '.join(['{}>{}'.format(k, v) for k, v in limits_dic.items()])

Or, using f-strings with python-3.6+,

query = ' & '.join([f'{k}>{v}' for k, v in limits_dic.items()])

print(query)

'A>0 & C>-1 & B>2'

Pass the query string to df.query, it's meant for this very purpose:

out = df.query(query)
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

What if my column names have whitespace, or other weird characters?

From pandas 0.25, you can wrap your column name in backticks so this works:

query = ' & '.join([f'`{k}`>{v}' for k, v in limits_dic.items()])

See this Stack Overflow post for more.


You could also use df.eval if you want to obtain a boolean mask for your query, and then indexing becomes straightforward after that:

mask = df.eval(query)
print(mask)

0    False
1     True
2     True
3    False
4     True
dtype: bool

out = df[mask]
print(out)

    A  B  C
1   2  5  2
2  10  3  1
4   3  6  2

String Data

If you need to query columns that use string data, the code above will need a slight modification.

Consider (data from this answer):

df = pd.DataFrame({'gender':list('MMMFFF'),
                   'height':[4,5,4,5,5,4],
                   'age':[70,80,90,40,2,3]})

print (df)
  gender  height  age
0      M       4   70
1      M       5   80
2      M       4   90
3      F       5   40
4      F       5    2
5      F       4    3

And a list of columns, operators, and values:

column = ['height', 'age', 'gender']
equal = ['>', '>', '==']
condition = [1.68, 20, 'F']

The appropriate modification here is:

query = ' & '.join(f'{i} {j} {repr(k)}' for i, j, k in zip(column, equal, condition))
df.query(query)

   age gender  height
3   40      F       5

For information on the pd.eval() family of functions, their features and use cases, please visit Dynamic Expression Evaluation in pandas using pd.eval().

cs95
  • 379,657
  • 97
  • 704
  • 746
  • 1
    In f-strings you can use the shorthand {k!r} above, rather than {repr(k)}...helps in long expressions like above. – The Aelfinn Oct 02 '18 at 17:11
  • how to handle if there is multiple value for the same column – Abhis Nov 25 '19 at 05:39
  • @Abhis What is that supposed to look like? – cs95 Nov 25 '19 at 08:43
  • @cs95 What if my column name itself has some operators such as `C > D` and I want to compare two such columns. Should I be adding double quotes around each column name, and entire query in single quote? – MasayoMusic Jun 19 '20 at 19:44
  • @MasayoMusic I think you can add backticks to the column name like `\`C > D\`` from pandas 0.25 – cs95 Jun 19 '20 at 20:16
  • OK great, I've updated the answer anyway, I assume more folks would have the same question. Appreciate the comment. – cs95 Jun 19 '20 at 20:21
  • 1
    That part about using a 'mask' answered a question I've had about how to combine 'query' with the selection of a subset of columns, when using 'loc' to avoid chained indexing. Thank you! – etotheipi Jan 19 '21 at 04:44
6

An alternative to @coldspeed 's version:

conditions = None
for key, val in limit_dic.items():
    cond = df[key] > val
    if conditions is None:
        conditions = cond
    else:
        conditions = conditions & cond
print(df[conditions])
Victor Yan
  • 3,339
  • 2
  • 28
  • 28
  • Thanks for this. I couldn't find a way to make the accepted answer work joining together `isin` conditions that referenced a python list defined in my code. – kev8484 Aug 23 '19 at 15:17
3

An alternative to both posted, that may or may not be more pythonic:

import pandas as pd
import operator
from functools import reduce

df = pd.DataFrame({"A": [6, 2, 10, -5, 3],
                   "B": [2, 5, 3, 2, 6],
                   "C": [-5, 2, 1, 8, 2]})

limits_dic = {"A": 0, "B": 2, "C": -1}

# equiv to [df['A'] > 0, df['B'] > 2 ...]
loc_elements = [df[key] > val for key, val in limits_dic.items()]

df = df.loc[reduce(operator.and_, loc_elements)]
zr0gravity7
  • 2,917
  • 1
  • 12
  • 33
yvesva
  • 711
  • 7
  • 10
1

How I do this without creating a string and df.query:

limits_dic = {"A" : 0, "B" : 2, "C" : -1}
cond = None

# Build the conjunction one clause at a time 
for key, val in limits_dic.items():
    if cond is None:
        cond = df[key] > val
    else:
        cond = cond & (df[key] > val)

df.loc[cond]

    A  B  C
0   2  5  2
1  10  3  1
2   3  6  2

Note the hard coded (>, &) operators (since I wanted to follow your example exactly).

rodrigo-silveira
  • 12,607
  • 11
  • 69
  • 123