11

I'm trying to select rows of a DataFrame based on a list of conditions that needs to be all satisfied. Those conditions are stored in a dictionary and are of the form {column: max-value}.

This is an example: dict = {'name': 4.0, 'sex': 0.0, 'city': 2, 'age': 3.0}

I need to select all DataFrame rows where the corresponding attribute is less than or equal to the corresponding value in the dictionary.

I know that for selecting rows based on two or more conditions I can write:

rows = df[(df[column1] <= dict[column1]) & (df[column2] <= dict[column2])]

My question is, how can I select rows that matches the conditions present in a dictionary in a Pythonic way? I tried this way,

keys = dict.keys() 
rows = df[(df[kk] <= dict[kk]) for kk in keys]

but it gives me an error = "[ expected" that doesn't disappear even putting the [ symbol.

Juan Carlos
  • 367
  • 2
  • 8
  • 16
  • 4
    Don't name your variable `dict` because `dict` is a builtin python dictionary constructor. – Abdou Aug 09 '17 at 12:48

3 Answers3

8

we can use DataFrame.query() method like this:

In [109]: dct = {'name': 4.0, 'sex': 0.0, 'city': 2, 'age': 3.0}

In [110]: qry = ' and '.join(['{} <= {}'.format(k,v) for k,v in dct.items()])

In [111]: qry
Out[111]: 'name <= 4.0 and sex <= 0.0 and city <= 2 and age <= 3.0'

In [112]: df.query(qry)
...
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

You could take advantage of Pandas' automatic axis alignment. Given a DataFrame with columns ['age', 'city', 'name', 'sex'] and a Series with the same index, you can compare every entry in the DataFrame against the corresponding value in the Series using

In [29]: df < pd.Series(dct)
Out[29]: 
      age   city   name    sex
0   False  False  False  False
1   False  False  False  False
2    True  False  False  False
3   False   True  False  False
4    True   True   True  False
...

Then you can find the rows which are all True using

mask = (df <= pd.Series(dct)).all(axis=1)

and select those rows with df.loc[mask, :]. For example,

import numpy as np
import pandas as pd
np.random.seed(2017)
N = 300
df = pd.DataFrame({'name':np.random.randint(10, size=N),
                   'sex':np.random.randint(2, size=N),
                   'city':np.random.randint(10, size=N),
                   'age':np.random.randint(10, size=N)})
dct = {'name': 4.0, 'sex': 0.0, 'city': 2, 'age': 3.0}

mask = (df <= pd.Series(dct)).all(axis=1)
print(df.loc[mask, :])

yields

     age  city  name  sex
7      3     2     0    0
10     1     2     4    0
150    1     2     4    0
188    2     2     2    0
198    3     2     3    0
229    1     2     0    0
254    1     2     2    0
275    3     2     1    0
276    0     1     4    0
299    3     1     2    0
unutbu
  • 842,883
  • 184
  • 1,785
  • 1,677
  • Does anyone know how @MaxU and @ unutbu 's answers perform compared to each other? – srcerer Mar 17 '19 at 13:47
  • 2
    @srcerer: On my machine, especially for larger DataFrames, [MaxU's answer](https://stackoverflow.com/a/45590819/190597) (using `query`) is faster. Since it is empowering to know how to test this for yourself, you may want to take a look at [the timeit module](https://stackoverflow.com/q/8220801/190597) or for more convenience, [IPython's %timeit magic function](https://stackoverflow.com/q/29280470/190597). – unutbu Mar 17 '19 at 14:45
  • Thanks, I was just too lazy to test for myself. I like that your method doesn't require generating a string. – srcerer Mar 17 '19 at 15:07
0

You can also try:

import pandas as pd
import numpy as np


N = 300

df = pd.DataFrame({'name':np.random.randint(10, size=N),
                   'sex':np.random.randint(2, size=N),
                   'city':np.random.randint(10, size=N),
                   'age':np.random.randint(10, size=N)})

dct = {'name': 4.0, 'sex': 0.0, 'city': 2, 'age': 3.0}

df.loc[np.prod([df[k] <= v for k,v in dct.items()],axis=0).astype(bool),:]

#      age  city  name  sex
# 7      3     2     0    0
# 10     1     2     4    0
# 150    1     2     4    0
# 188    2     2     2    0
# 198    3     2     3    0
# 229    1     2     0    0
# 254    1     2     2    0
# 275    3     2     1    0
# 276    0     1     4    0
# 299    3     1     2    0
Abdou
  • 12,931
  • 4
  • 39
  • 42