2

I want to select rows/groups that with year 2000 and 2001 and need to have both year 2000 and 2001 in a data set like below:

ID,year,age
810006862,2000,49
810006862,2001,
810006862,2002,
810006862,2003,52
810023112,2000,27
810023112,2004,28
810023112,2005,29
810023112,2006,30
810033622,2000,24
810033622,2001,25

I've tried the following codes, but both came back with empty data frame.

df1411 = df.groupby('ID').filter(lambda x: set(x['year']) == {'2000', '2001'})

df[df.groupby('ID')['year'].transform(lambda x: set(x.values.tolist()) == {'2000','2001'})]

These two codes below pick up one extra group that has only year 2000. I'd like to have group with both year 2000 and 2001.

df[df['year'].isin({2000, 2001})]
df.loc[df.year.isin(['2000', '2001'])]

The result I want is somewhat like this:

ID,year,age
810006862,2000,49
810006862,2001,
810033622,2000,24
810033622,2001,25
ohlav
  • 45
  • 6
  • Possible duplicate of [Select rows from a DataFrame based on values in a column in pandas](https://stackoverflow.com/questions/17071871/select-rows-from-a-dataframe-based-on-values-in-a-column-in-pandas) – Erfan Oct 03 '19 at 12:10
  • Thanks for the link. I'll read through the link and hopefully find my solution. – ohlav Oct 03 '19 at 12:27

2 Answers2

1

Have you tried -

df.loc[df.year.isin([2000, 2001])

The addition in your comment is a different question - but you can do this:

#Create a pivoted dataframe
dfp = df.pivot(index="ID", columns="year", values="year")
#Create a column in this pivoted dataframe to indicate whether both years are present
dfp["has_both_years"] = dfp.apply(lambda x: x["2000"] and x["2001"], axis=1)
#Select where True
dfp = dfp.loc[dfp.has_both_years.notnull()]
#Subset your original DataFrame
dfs = df.loc[df.ID.isin(dfp.index)]

#          ID  year age
#0  810006862  2000  49
#1  810006862  2001    
#2  810006862  2002    
#3  810006862  2003  52
#8  810033622  2000  24
#9  810033622  2001  25
Mortz
  • 4,654
  • 1
  • 19
  • 35
  • Thank you. Not yet, let me have a try! – ohlav Oct 03 '19 at 12:01
  • It worked perfectly! Thanks a lot. Meanwhile, do you know what my codes are wrong? – ohlav Oct 03 '19 at 12:05
  • Can I ask you a further question on this? What if I want to select a group that contains both year 2000 and 2001 and only year 2000 and 2001, how can it be modified? because I notice that if there is a group that has one row value with year 2000, it will also be selected. For my purpose, I would like to select only groups that have both year 2000 and 2001. – ohlav Oct 03 '19 at 12:20
  • Thank you. Mortz. Is it possible to only select rows with year 2000 and 2001 in your solution? Sorry I'm very new to Python, not sure how to modify your code to fit my purpose. :) – ohlav Oct 03 '19 at 12:42
  • Repeat step-1 : `dfs.loc[dfs.year.isin([2000, 2001])` – Mortz Oct 03 '19 at 12:45
1

You can use:

df[df['year'].isin({2000, 2001})]

If you want to select groups that have both year 2000 and 2001, not just one, as you have indicated in the comment, you could use:

years = {2000, 2001}
df2 = df.groupby('ID').filter(lambda x: years.issubset(x['year']))
df2[df2['year'].isin(years)]
#   ID          year    age
#0  810006862   2000    49.0
#1  810006862   2001    NaN
#8  810033622   2000    24.0
#9  810033622   2001    25.0
  • Thank you. It works fine. However, it catches the group/row has only year 2000 as well. I would like to select groups that have both year 2000 and 2001, not just one. Any idea? – ohlav Oct 03 '19 at 12:26
  • Hi again. It worked perfectly! Thanks a lot!! By the way, do you know what's wrong with my code? – ohlav Oct 03 '19 at 12:49
  • When you group based on 'ID', each group contains various years. For example, for your example data, groupby returns groups of {2000,20001,2003} for ID=810006862, {2000,2004,2005,2006} for ID=810023112, and {2000,2001} for ID=810033622. So, equality checking (set(x['year']) == {'2000', '2001'}) returned False. Instead, you need to perform subset checking by issubset(). This selects all groups that have 2000 and 2001 in year. Then, you exclude those rows that contain year other than 2000 and 2001 by df2[df2['year'].isin(years)]. – Ali A. Noroozi Oct 03 '19 at 12:57
  • Well explained! Now I see the logic. Thank you very much for your help! – ohlav Oct 03 '19 at 13:01