0

I have a pandas dataframe "df" having columns[Country,Region,Happiness Score,Year].

There are total 165 countries in df having data for 3 years(2015,2016,2017), therefore length of df is 165*3=495.

There are total 10 unique regions in df and each country has its region name in 'Region' column.

I want to produce an output showing country having minimum and maximum happiness score for each unique region.

For example, output should have columns like [Region,Min Happiness Score,Max Happiness Score,Country having min Happiness Score for that region,Country having max Happiness Score for that region].

Since there are 10 unique regions, it is obvious that output should also have length 10. Solutions using pandas functions and pandasql both are welcome but I would prefer using solutions with pandas functions.

Mayank Porwal
  • 33,470
  • 8
  • 37
  • 58
Osho
  • 31
  • 4
  • Hi Osho. Please tell us a specific problem where you got stuck. Please don't expect people to write the whole code for you. – Mayank Porwal May 10 '20 at 16:35
  • 1
    In addition to describing the data with words [show us the data](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – timgeb May 10 '20 at 16:35
  • @MayankPorwal actually the whole problem is longer than this, but I asked only the part on which I got stuck. After researching I found that I can get min and max happiness score for the region by using df.groupby('Region')['Happiness Score'].transform('min') or ('max') but was not able to get countries for that region. – Osho May 10 '20 at 18:21
  • @timgeb The actual dataset contains many more columns, so i thought it would be best to mention only the columns on which I need to work to avoid confusion. – Osho May 10 '20 at 18:23
  • @Osho you are not supposed to show your whole dataset. See the link I gave you. – timgeb May 10 '20 at 18:25

1 Answers1

1

This will be close to what you are after - group by the region and find min and max:

mins = df.groupby('Region')['Happiness Score'].min()
maxs = df.groupby('Region')['Happiness Score'].max()
df2=pd.concat([mins,maxs],axis=1)
df2.columns=(['Min Happiness Score','Max Happiness Score'])
df2['Region']=mins.index

edited after the good comment by Q. Hoang - you can do this somewhat more elegantly with

df.groupby(['Country','Region'])['Happiness Score'].agg(['min','max'])

which should get you the country, region, and min/max scores per region. The takehome for me is a lot of pandas functions accept list arguments.

jeremy_rutman
  • 3,552
  • 4
  • 28
  • 47
  • `df.groupby('Region')['Happiness Score'].agg(['min','max'])` – Quang Hoang May 10 '20 at 17:18
  • How to get the corresponding countries for the regions? – Osho May 10 '20 at 18:29
  • @QuangHoang Unfortunately, it is not the correct output. I want 'region','min score','max score','country having that min score', 'country having that max score'. And this output should have 10 columns(according to every region). What you suggested gives 165 columns which means it shows every 'country','region of that country','min score of that country','max score of that country'. I tried different permutations of what you suggested also adding 'year' in the groupby but none of them is giving the desired output. – Osho May 12 '20 at 09:16