Flexibly select pandas dataframe rows using dictionary

Question

Suppose I have the following dataframe:

df = pd.DataFrame({'color':['red', 'green', 'blue'], 'brand':['Ford','fiat', 'opel'], 'year':[2016,2016,2017]})

        brand   color   year
0       Ford    red     2016
1       fiat    green   2016
2       opel    blue    2017

I know that to select using multiple columns I can do something like:

new_df = df[(df['color']=='red')&(df['year']==2016)]

Now what I would like to do is find a way to use a dictionary to select the rows I want where the keys of the dictionary represent the columns mapping to the allowed values. For example applying the following dictionary {'color':'red', 'year':2016} on df would yield the same result as new_df.

I can already do it with a for loop, but I'd like to know if there are any faster and/or more 'pythonic' ways of doing it!

Please include time taken of method.

Re: edit, OP I recommend you try our solutions on your data and check which is faster. It depends on the data, really. Btw the query solution will not work if your column names cannot be turned into valid identifier names in python. Edit: OK, well, I added an alternative that will. — cs95, Dec 23 '18 at 14:08
@coldspeed Ok, thanks for the suggestion. I'll give it a try later today and choose an answer accordingly. — Ludo, Dec 23 '18 at 14:12

cs95 · Accepted Answer · 2018-12-23T14:10:38.250

Yes, there is! You can build a query string using a simple list comprehension, and pass the string to query for dynamic evaluation.

query = ' and '.join([f'{k} == {repr(v)}' for k, v in m.items()]) 
# query = ' and '.join(['{} == {}'.format(k, repr(v)) for k, v in m.items()]) 
new_df = df.query(query)

print(query)
# "color == 'red' and year == 2016"

print(new_df)
  color brand  year
0   red  Ford  2016

For more on query (and eval), see my post here: Dynamic Expression Evaluation in pandas using pd.eval()

For better performance, AND handling column names with spaces, etc, use logical_and.reduce:

df[np.logical_and.reduce([df[k] == v for k,v in m.items()])] 

  color brand  year
0   red  Ford  2016

RomanPerekhrest · Answer 2 · 2018-12-23T14:15:23.720

2

With single expression:

In [728]: df = pd.DataFrame({'color':['red', 'green', 'blue'], 'brand':['Ford','fiat', 'opel'], 'year':[2016,2016,2017]})

In [729]: d = {'color':'red', 'year':2016}

In [730]: df.loc[np.all(df[list(d)] == pd.Series(d), axis=1)]
Out[730]: 
  brand color  year
0  Ford   red  2016

edited Dec 23 '18 at 14:15

answered Dec 23 '18 at 14:03

RomanPerekhrest

88,541
4
65
105

hmmm, more flexible because working with columns with spaces in columns names – jezrael Dec 23 '18 at 14:05
Or use `merge` like in dupe. – jezrael Dec 23 '18 at 14:06
1

I can't imagine this being very performant, if at all. – cs95 Dec 23 '18 at 14:07
ya, for increase performance use `df.loc[ np.all(df[list(d)] == pd.Series(d)), axis=1)]` – jezrael Dec 23 '18 at 14:08
@jezrael, thanks, switched to `np.all` – RomanPerekhrest Dec 23 '18 at 14:15
@coldspeed - Yes, it is really data dependent what solution is faster, to much variables like number of rows, number of values of dict, number of matched rows. So I agree best test in real data – jezrael Dec 23 '18 at 14:17

Flexibly select pandas dataframe rows using dictionary

2 Answers2

Linked