Grouping by a column based on the minimum year value returned from the max value of another column

Question

Apologies for the confusing title. It is explained better below. I currently have a pandas data frame that looks something like this:

user_id year grade_id
1       2005 47
1       2003 70
1       2004 70
2       2011 50
2       2003 43
2       2009 60

I want to group by the user_id and return the minimum year value based on the max value of grade_id. So the output for the above data frame would look like so:

user_id year grade_id
1       2003 70
2       2009 60

Is there a simple/elegant way to do this? I have tried things like the following:

tmp_df = df.groupby(["user_id", "year"])["grade_id"].agg(np.max)

However, this does not return the correct year. I have already checked out a few StackOverflow posts but none seem to be the same issue. Any help would be much appreciated.

What is `ascent_year`? – Celius Stingher Dec 11 '19 at 15:39 — Celius Stingher, Dec 11 '19 at 15:39
That should be "year". Changed now! – Aido Dec 11 '19 at 15:42 — Aido, Dec 11 '19 at 15:42

score 1 · Answer 1 · answered Dec 11 '19 at 15:45

Given the information this should work for you! I couldn't find an 'easy' way to solve it, but this generates the expected output!

import pandas as pd
data = {'user_id':[1,1,1,2,2,2],'year':[2005,2003,2004,2011,2003,2009],'grade_id':[47,70,70,50,43,60]}
df = pd.DataFrame(data)
print(df)
final_df = df.groupby('user_id').agg({'grade_id':'max'}).reset_index().merge(df[['grade_id','year']],how='left').sort_values('year').drop_duplicates('user_id')
print(final_df)

Output:

   user_id  grade_id  year
0        1        70  2003
2        2        60  2009

score 1 · Accepted Answer · answered Dec 11 '19 at 18:33

1

This answer will work well on large data frames.

df.sort_values('grade_id', ascending=False).drop_duplicates(['user_id'])

Hope this helps.

answered Dec 11 '19 at 18:33

Michael Dowling

26
2

1

This will not work for all cases, the default option for `keep` in `drop_duplicates` is `first`. If by any chance, the `grade_id` has two years, and the first one is not the minimum, then it will fail to work. – Celius Stingher Dec 11 '19 at 20:38
Good point, I checked to see if sort_values can sort by two columns and it can, so I used this: https://stackoverflow.com/a/17141755/9111394 – Aido Dec 17 '19 at 14:11

score 0 · Answer 3 · answered Dec 11 '19 at 15:50

0

This might work with your example

df.groupby(['user_id', 'year']).grade_id.max().groupby(level=0).head(1).reset_index(name='grade_id')

answered Dec 11 '19 at 15:50

ATL

521
3
8

Grouping by a column based on the minimum year value returned from the max value of another column

3 Answers3