4

Suppose I have the following dataframe:

df = pd.DataFrame({'color':['red', 'green', 'blue'], 'brand':['Ford','fiat', 'opel'], 'year':[2016,2016,2017]})

        brand   color   year
0       Ford    red     2016
1       fiat    green   2016
2       opel    blue    2017

I know that to select using multiple columns I can do something like:

new_df = df[(df['color']=='red')&(df['year']==2016)]

Now what I would like to do is find a way to use a dictionary to select the rows I want where the keys of the dictionary represent the columns mapping to the allowed values. For example applying the following dictionary {'color':'red', 'year':2016} on df would yield the same result as new_df.

I can already do it with a for loop, but I'd like to know if there are any faster and/or more 'pythonic' ways of doing it!

Please include time taken of method.

Ludo
  • 2,307
  • 2
  • 27
  • 58
  • Re: edit, OP I recommend you try our solutions on your data and check which is faster. It depends on the data, really. Btw the query solution will not work if your column names cannot be turned into valid identifier names in python. Edit: OK, well, I added an alternative that will. – cs95 Dec 23 '18 at 14:08
  • @coldspeed Ok, thanks for the suggestion. I'll give it a try later today and choose an answer accordingly. – Ludo Dec 23 '18 at 14:12

2 Answers2

2

Yes, there is! You can build a query string using a simple list comprehension, and pass the string to query for dynamic evaluation.

query = ' and '.join([f'{k} == {repr(v)}' for k, v in m.items()]) 
# query = ' and '.join(['{} == {}'.format(k, repr(v)) for k, v in m.items()]) 
new_df = df.query(query)

print(query)
# "color == 'red' and year == 2016"

print(new_df)
  color brand  year
0   red  Ford  2016

For more on query (and eval), see my post here: Dynamic Expression Evaluation in pandas using pd.eval()


For better performance, AND handling column names with spaces, etc, use logical_and.reduce:

df[np.logical_and.reduce([df[k] == v for k,v in m.items()])] 

  color brand  year
0   red  Ford  2016
cs95
  • 379,657
  • 97
  • 704
  • 746
2

With single expression:

In [728]: df = pd.DataFrame({'color':['red', 'green', 'blue'], 'brand':['Ford','fiat', 'opel'], 'year':[2016,2016,2017]})

In [729]: d = {'color':'red', 'year':2016}

In [730]: df.loc[np.all(df[list(d)] == pd.Series(d), axis=1)]
Out[730]: 
  brand color  year
0  Ford   red  2016
RomanPerekhrest
  • 88,541
  • 4
  • 65
  • 105