0

I have a pd.DataFrame

import pandas as pd
country = ['US', 'US', 'US', 'UK', 'UK', 'UK']
year = ['1990', '1991', '2020', '1990', '1991', '2020']
people = [20, 34, 456, 5, 7, 300]

df = pd.DataFrame(zip(country, year, people), columns = ['country', 'year', 'people'])
country year    people
0   US  1990    20
1   US  1991    34
2   US  2020    456
3   UK  1990    5
4   UK  1991    7
5   UK  2020    300

I wish to locate year '2020' and '1990'. I understand that this can be achieved with:

df.loc[(df.year == '2020') | (df.year == '1990')]

or

df.query('year == [\'2020\', \'1990\']')

to get the output:

country year    people
0   US  1990    20
2   US  2020    456
3   UK  1990    5
5   UK  2020    300

However, I'd like to perform this 'query' with the in operator. Trying:

df.loc[df['year'] in ['2020', '1990']]

which raises the error:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

I expect to use the in operator in my pandas.DataFrame subsetting, as it requires the least amount of typing.

What is the best way to mitigate this error raised by the in operator?

Carl C
  • 135
  • 8

2 Answers2

2
df[df['year'].isin(['1990','2020'])]

or

df.loc[df['year'].isin(['1990','2020'])]
wwnde
  • 26,119
  • 6
  • 18
  • 32
1

Use the .isin() function.

df.loc[df['year'].isin(['1990','2020'])]

You input a list of options that you want 'year' to be, and pandas will return a series of boolean values. That series will in turn be interpreted by the .loc() to return only the rows where df['year'] == the desired values.

Jhirschibar
  • 215
  • 1
  • 3
  • 16