I have a table, in this case with a list of years and movie genres - and a corresponding 'popularity' score. For a given (unique) year, there are multiple scores for each genre.
I would like to get the genre with the highest popularity score for each year, i.e. just the top genre for a given year. In SQL this would be straightforward, I would simply use max()
and first()
as my aggregates. I'm trying to learn how to do the equivalent using Pandas, any help would be appreciated.
melted_genres
looks like this:
release_year popularity Genres
0 2015 32.985763 Action
1 2015 28.419936 Action
2 2015 13.112507 Adventure
3 2015 11.173104 Action
4 2015 9.335014 Action
5 2015 9.110700 Western
6 2015 8.654359 Science Fiction
7 2015 7.667400 Drama
8 2015 7.404165 Family
9 2015 6.326804 Comedy
10 2015 6.200282 Action
11 2015 6.189369 Science Fiction
12 2015 6.118847 Drama
13 2015 5.984995 Action
14 2015 5.944927 Action
15 2015 5.898400 Crime
16 2015 5.749758 Crime
17 2015 5.573184 Science Fiction
18 2015 5.556818 Romance
19 2015 5.476958 War
20 2015 5.462138 Action
21 2015 5.337064 Action
22 2015 4.907832 Action
23 2015 4.710402 Drama
24 2015 4.648046 Comedy
25 2015 4.566713 Action
26 2015 4.564549 Comedy
27 2015 4.503789 Crime
28 2015 4.062293 Drama
29 2015 3.968891 Action
... ... ... ...
59232 1966 0.310688 None
59233 1966 0.299911 Romance
59234 1966 0.239435 None
59235 1966 0.291704 None
59236 1966 0.151845 None
59237 1966 0.276133 None
59238 1966 0.102530 None
59239 1966 0.264925 None
59240 1966 0.252399 Action
59241 1966 0.236098 None
59242 1966 0.230873 None
59243 1966 0.212716 None
59244 1966 0.034555 None
59245 1966 0.207257 None
59246 1966 0.206537 None
59247 1966 0.202473 Mystery
59248 1966 0.342791 Foreign
59249 1966 0.227220 None
59250 1966 0.146402 None
59251 1966 0.141026 None
59252 1966 0.140934 None
59253 1966 0.131378 None
59254 1966 0.317824 None
59255 1966 0.089072 None
59256 1966 0.087034 None
59257 1966 0.080598 None
59258 1966 0.065543 None
59259 1966 0.065141 None
59260 1966 0.064317 None
59261 1966 0.035919 None
I have tried the following:
pop_genres = melted_genres.groupby(['release_year','Genres'])['popularity'].max()
Which results in:
release_year Genres
1960 Action 0.590724
Adventure 0.700981
Comedy 0.396000
Crime 0.372163
Drama 0.566305
Family 0.465879
Fantasy 0.712389
Foreign 0.194948
History 1.136943
Horror 2.610362
Music 0.423531
Romance 0.947307
Science Fiction 0.712389
Thriller 2.610362
War 0.256779
Western 1.872132
1961 Action 0.754548
Adventure 2.631987
Animation 2.631987
Comedy 2.631987
Crime 0.900245
Drama 1.752027
Family 2.631987
Fantasy 0.113651
Foreign 0.113651
History 0.538364
Horror 0.296403
Music 0.900245
Mystery 0.712793
Romance 1.752027
...
2015 Action 2.332774
Adventure 3.360685
Animation 1.195554
Comedy 1.073178
Crime 1.729104
Played with using aggregate()
and other functions, but so far whatever I use drops the actual genre category value (i.e. I end up with two columns, rather than three)