1

Apologies for the confusing title. It is explained better below. I currently have a pandas data frame that looks something like this:

user_id year grade_id
1       2005 47
1       2003 70
1       2004 70
2       2011 50
2       2003 43
2       2009 60

I want to group by the user_id and return the minimum year value based on the max value of grade_id. So the output for the above data frame would look like so:

user_id year grade_id
1       2003 70
2       2009 60

Is there a simple/elegant way to do this? I have tried things like the following:

tmp_df = df.groupby(["user_id", "year"])["grade_id"].agg(np.max)

However, this does not return the correct year. I have already checked out a few StackOverflow posts but none seem to be the same issue. Any help would be much appreciated.

Aido
  • 150
  • 4
  • 9

3 Answers3

1

Given the information this should work for you! I couldn't find an 'easy' way to solve it, but this generates the expected output!

import pandas as pd
data = {'user_id':[1,1,1,2,2,2],'year':[2005,2003,2004,2011,2003,2009],'grade_id':[47,70,70,50,43,60]}
df = pd.DataFrame(data)
print(df)
final_df = df.groupby('user_id').agg({'grade_id':'max'}).reset_index().merge(df[['grade_id','year']],how='left').sort_values('year').drop_duplicates('user_id')
print(final_df)

Output:

   user_id  grade_id  year
0        1        70  2003
2        2        60  2009
Celius Stingher
  • 17,835
  • 6
  • 23
  • 53
1

This answer will work well on large data frames.

df.sort_values('grade_id', ascending=False).drop_duplicates(['user_id'])

Hope this helps.

  • 1
    This will not work for all cases, the default option for `keep` in `drop_duplicates` is `first`. If by any chance, the `grade_id` has two years, and the first one is not the minimum, then it will fail to work. – Celius Stingher Dec 11 '19 at 20:38
  • Good point, I checked to see if sort_values can sort by two columns and it can, so I used this: https://stackoverflow.com/a/17141755/9111394 – Aido Dec 17 '19 at 14:11
0

This might work with your example

df.groupby(['user_id', 'year']).grade_id.max().groupby(level=0).head(1).reset_index(name='grade_id')
ATL
  • 521
  • 3
  • 8