How do use Pandas to do the equivalent (from SQL) of a first() and max() groupby?

Question

I have a table, in this case with a list of years and movie genres - and a corresponding 'popularity' score. For a given (unique) year, there are multiple scores for each genre.

I would like to get the genre with the highest popularity score for each year, i.e. just the top genre for a given year. In SQL this would be straightforward, I would simply use max() and first() as my aggregates. I'm trying to learn how to do the equivalent using Pandas, any help would be appreciated.

melted_genres looks like this:

    release_year    popularity  Genres
0   2015    32.985763   Action
1   2015    28.419936   Action
2   2015    13.112507   Adventure
3   2015    11.173104   Action
4   2015    9.335014    Action
5   2015    9.110700    Western
6   2015    8.654359    Science Fiction
7   2015    7.667400    Drama
8   2015    7.404165    Family
9   2015    6.326804    Comedy
10  2015    6.200282    Action
11  2015    6.189369    Science Fiction
12  2015    6.118847    Drama
13  2015    5.984995    Action
14  2015    5.944927    Action
15  2015    5.898400    Crime
16  2015    5.749758    Crime
17  2015    5.573184    Science Fiction
18  2015    5.556818    Romance
19  2015    5.476958    War
20  2015    5.462138    Action
21  2015    5.337064    Action
22  2015    4.907832    Action
23  2015    4.710402    Drama
24  2015    4.648046    Comedy
25  2015    4.566713    Action
26  2015    4.564549    Comedy
27  2015    4.503789    Crime
28  2015    4.062293    Drama
29  2015    3.968891    Action
... ... ... ...
59232   1966    0.310688    None
59233   1966    0.299911    Romance
59234   1966    0.239435    None
59235   1966    0.291704    None
59236   1966    0.151845    None
59237   1966    0.276133    None
59238   1966    0.102530    None
59239   1966    0.264925    None
59240   1966    0.252399    Action
59241   1966    0.236098    None
59242   1966    0.230873    None
59243   1966    0.212716    None
59244   1966    0.034555    None
59245   1966    0.207257    None
59246   1966    0.206537    None
59247   1966    0.202473    Mystery
59248   1966    0.342791    Foreign
59249   1966    0.227220    None
59250   1966    0.146402    None
59251   1966    0.141026    None
59252   1966    0.140934    None
59253   1966    0.131378    None
59254   1966    0.317824    None
59255   1966    0.089072    None
59256   1966    0.087034    None
59257   1966    0.080598    None
59258   1966    0.065543    None
59259   1966    0.065141    None
59260   1966    0.064317    None
59261   1966    0.035919    None

I have tried the following:

pop_genres = melted_genres.groupby(['release_year','Genres'])['popularity'].max()

Which results in:

release_year    Genres 
1960            Action              0.590724
                Adventure           0.700981
                Comedy              0.396000
                Crime               0.372163
                Drama               0.566305
                Family              0.465879
                Fantasy             0.712389
                Foreign             0.194948
                History             1.136943
                Horror              2.610362
                Music               0.423531
                Romance             0.947307
                Science Fiction     0.712389
                Thriller            2.610362
                War                 0.256779
                Western             1.872132
1961            Action              0.754548
                Adventure           2.631987
                Animation           2.631987
                Comedy              2.631987
                Crime               0.900245
                Drama               1.752027
                Family              2.631987
                Fantasy             0.113651
                Foreign             0.113651
                History             0.538364
                Horror              0.296403
                Music               0.900245
                Mystery             0.712793
                Romance             1.752027
...  
2015            Action              2.332774
                Adventure           3.360685
                Animation           1.195554
                Comedy              1.073178
                Crime               1.729104

Played with using aggregate() and other functions, but so far whatever I use drops the actual genre category value (i.e. I end up with two columns, rather than three)

Hi, thanks for the link @jesrael. I'm pretty new to python, and struggling to apply the examples in your linked resource. Any chance you might be able to open this up, in case someone is able to provide additional support? — Bilal Mansoor, Aug 14 '19 at 09:45
Hi, I have the answer to this, I'm not sure it's a duplicate - might be worth unmarking this for others to see. — Bilal Mansoor, Aug 15 '19 at 12:58

moys · Accepted Answer · 2019-08-15T12:42:01.740

0

Get the index of the max item in each group with idxmax and locate the whole row with df.loc

 df.loc[df.groupby(['release_year'])['popularity'].idxmax()].reset_index(drop=True)

edited Aug 15 '19 at 12:42

answered Aug 14 '19 at 09:02

moys

7,747
2
11
42

Thanks for this. I tried these, they continue to give me multiple rows for each year - the desired outcome would be to only have the top genre for each given year. Any chance you might be able to help? – Bilal Mansoor Aug 14 '19 at 09:47
is the data you posted here exactly how it is in a pandas dataframe? post it exactly as you have & also the result you are getting. – moys Aug 14 '19 at 11:36
Thanks for getting back. I'll paste a larger snippet of the dataset into the comment, and the result. – Bilal Mansoor Aug 15 '19 at 06:32
you do have all three columns in your result. Do you mean that for some rows the year is missing? Try the new code & see if that works for you. – moys Aug 15 '19 at 12:18
Thanks for the reply. To add a little clarification, I'm looking to get the top Genre by Year, which in SQL I could do by including both the popularity and genre columns and using the max() and first() respectively. I can't seem to achieve the same result with Pandas; either I get all the genres, or I get the top value without but lose the genre attribute. – Bilal Mansoor Aug 15 '19 at 12:23
ok. I think i now know what you are after. Try the updated code. – moys Aug 15 '19 at 12:42
Thanks so much, this did the trick! It will take a moment to understand this, but it does what I need. Cheers buddy! – Bilal Mansoor Aug 15 '19 at 12:56

How do use Pandas to do the equivalent (from SQL) of a first() and max() groupby?

1 Answers1