1

Question: How can I sort data by bins using groupby in pandas?

What I want is the following:

release_year listed_in

1920        Documentaries   
1930        TV Shows    
1940        TV Shows
1950        Classic Movies, Documentaries
1960        Documentaries
1970        Classic Movies, Documentaries
1980        Classic Movies, Documentaries
1990        Classic Movies, Documentaries
2000        Classic Movies, Documentaries
2010        Children & Family Movies, Classic Movies, Comedies
2020        Classic Movies, Dramas

To achieve this result I tried the following formula:

bins = [1925,1950,1960,1970,1990,2000,2010,2020]
groups = df.groupby(['listed_in', pd.cut(df.release_year, bins)])
groups.size().unstack()

It shows the following result:

release_year (1925,1950] (1950,1960] (1960,1970] (1970,1990] (1990,2000] (2000,2010] (2010, 2020] 
listed_in 
Action & Adventure 0 0 0 0 9 16 43
Action & Adventure, Anime Features, Children & Family Movies 0 0 0 0 0 0 1
Action & Adventure, Anime Features, Classic Movies 0 0 0 1 0 0 0
...

461 rows x 7 columns

I also tried the following formula:

df['release_year'] = df['release_year'].astype(str).str[0:2] + '0'
df.groupby('release_year')['listed_in'].apply(lambda x: x.mode().iloc[0])

The result was the following:


release_year 
190         Dramas 
200     Documentaries
Name: listed_in, dtype:object

Here is a sample of the dataset:

import pandas as pd
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',NaN,NaN],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'], 
'country':['United States, India, South Korea, China',
'United Kingdom','United States'], 
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})
Ska
  • 35
  • 5
  • wouldn't max of count give you the most frequent? – Joe Ferndz Jan 07 '21 at 22:33
  • 1
    Please include sample input data and expected output data (no images). Please see https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples on how you can do this, and I'm sure someone could answer your question more easily and effectively. Pro tip.... Use `df.groupby('release_year')['listed_in'].apply(lambda x: x.mode().iloc[0]).head(10).to_dict()` to get us the sample input data inside of `pd.DataFrame()`. Then, based off that sample data, manually create a table of the desired output and also include that in your question. Thank you! – David Erickson Jan 07 '21 at 22:40
  • @Ska Welcome to StackOverflow! Did I answer your question? If so, kindly click the checkmark next to the solution. Please see: https://meta.stackexchange.com/questions/5234/how-does-accepting-an-answer-work for more information. – David Erickson Jan 07 '21 at 23:29

1 Answers1

0

The simplest way to do this is use the first part of your code and simply make the last digit of the release_year a 0. Then you can .groupby decades and get the most popular genres for each decade i.e. the mode:

input:

import pandas as pd
import numpy as np
df = pd.DataFrame({
'show_id':['81145628','80117401','70234439'],
'type':['Movie','Movie','TV Show'],
'title':['Norm of the North: King Sized Adventure',
'Jandino: Whatever it Takes',
'Transformers Prime'],
'director':['Richard Finn, Tim Maltby',np.nan,np.nan],
'cast':['Alan Marriott, Andrew Toth, Brian Dobson',
'Jandino Asporaat','Peter Cullen, Sumalee Montano, Frank Welker'], 
'country':['United States, India, South Korea, China',
'United Kingdom','United States'], 
'date_added':['September 9, 2019',
'September 9, 2016',
'September 8, 2018'],
'release_year':['2019','2016','2013'],
'rating':['TV-PG','TV-MA','TV-Y7-FV'],
'duration':['90 min','94 min','1 Season'],
'listed_in':['Children & Family Movies, Comedies',
'Stand-Up Comedy','Kids TV'],
'description':['Before planning an awesome wedding for his',
'Jandino Asporaat riffs on the challenges of ra',
'With the help of three human allies, the Autob']})

code:

df['release_year'] = df['release_year'].astype(str).str[0:3] + '0'
df = df.groupby('release_year', as_index=False)['listed_in'].apply(lambda x: x.mode().iloc[0])
df

output:

    release_year  listed_in
0   2010          Children & Family Movies, Comedies
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • Thank you for your time and attention! However the problem was not solved. Please take a look at the edit. What do you suggest we try next? – Ska Jan 14 '21 at 14:41
  • @Ska please see my updated answer on the sample data that you have provided. IT appears to match your expected output.] – David Erickson Jan 14 '21 at 22:02