0

I have the following dataset enter image description here

and I want to find the most popular genres from year to year. So I was planning to do that by first grouping the rows by year and genre, then selecting the most repeated genre for each year.

I was able to group by year and genre then finding the counts using the following code x = df.groupby(['release_year','genres']).count()['id'] where id is just an arbitrary column I used to find the counts for each genre

And I get the following results

release_year  genres         
1960          Action               5
              Adventure            5
              Comedy               7
              Crime                2
              Drama               10
                                ... 
2015          Science Fiction     54
              TV Movie             6
              Thriller           103
              War                  6
              Western              4
Name: id, Length: 1000, dtype: int64

My problem is that I am unable to select the max for each year, can somebody help me do that?

iJezoul
  • 3
  • 4
  • `'release_year'` is now part of your MultiIndex. There are a few ways you could get the largest, depending on whether you want to get multiple values in the case for ties, but if you don't care about ties and are fine with one of tied largest in the case of ties you could use another groupby: `x = x.sort_values().groupby('release_year').tail(1)` So you sort the values so that the largest appears at the end, and then take the last row within each year group (e.g. the largest value since it was sorted to the end) – ALollz Jan 06 '22 at 17:22
  • Thank you so much! This works! But I would like to know how can I get multiple values in case of ties just for education. If you have time on your hands, can you explain that? – iJezoul Jan 06 '22 at 17:26
  • Yes, in that case, you could still use a `groupby` to broadcast the maximum value within each release_year to every row in that group. Then you would check where the value is equal to the maximum (which will be within group) to form a Boolean mask that you could use to subset the Series. So instead of the above code you could do : `x = x[x == x.groupby('release_year').transform('max')]`. You can see https://stackoverflow.com/questions/15705630/get-the-rows-which-have-the-max-value-in-groups-using-groupby for more information too – ALollz Jan 06 '22 at 17:32
  • 1
    Thank you so much! Can you just repost your comment as an answer so I can upvote it? and one final question, what would you think would be a good way to plot the data to show that over they years the drama genre has been dominant? I wanted to create a histogram for how many times a genre has been the max, but then that shows over the years and not from year to year – iJezoul Jan 06 '22 at 17:44

0 Answers0