9

I'm grouping a dataframe by multiple columns and aggregating to obtain multiple statistics. How to obtain a totally flat structure with each possible combination of group-keys enumerated as rows and each statistic present as columns?

import numpy as np
import pandas as pd

cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']

data = pd.DataFrame({
        'city': np.random.choice(cities, 12),
        'day': np.random.choice(days, 12),
        'people': np.random.normal(loc=10, size=12),
        'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg([np.mean, np.std])

This way I'm getting:

                   cats               people          
                   mean       std       mean       std
city   day                                            
Berlin Friday  6.146924  0.721263  10.445606  0.730992
       Monday  5.239267       NaN   9.022811       NaN
Oslo   Friday  6.322276  0.866899  11.579813  0.114341
       Monday  5.028919  0.815674  10.458439  1.182689

I need to get it flat:

city   day     cats_mean cats_std  people_mean people_std                                       
Berlin Friday  6.146924  0.721263  10.445606   0.730992
Berlin Monday  5.239267       NaN   9.022811        NaN
Oslo   Friday  6.322276  0.866899  11.579813   0.114341
Oslo   Monday  5.028919  0.815674  10.458439   1.182689
adam.ra
  • 1,068
  • 1
  • 10
  • 16

3 Answers3

13
In [36]: grouped.columns = grouped.columns.map('_'.join)

In [37]: grouped = grouped.reset_index()

In [38]: grouped
Out[38]:
     city     day  cats_mean  cats_std  people_mean  people_std
0  Berlin  Friday   5.852991  1.085163    11.078541    0.839688
1  Berlin  Monday   6.978343  0.630983     9.876106    1.846204
2    Oslo  Friday   6.096773  1.278176     9.710216    0.691672
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
3

Update using pd.NamedAgg

import numpy as np
import pandas as pd

cities = ['Berlin', 'Oslo']
days = ['Monday', 'Friday']

data = pd.DataFrame({
        'city': np.random.choice(cities, 12),
        'day': np.random.choice(days, 12),
        'people': np.random.normal(loc=10, size=12),
        'cats': np.random.normal(loc=6, size=12)})
grouped = data.groupby(['city', 'day']).agg(cats_mean=('cats', np.mean),
                                            cats_std=('cats', np.std),
                                            people_mean=('people', np.mean),
                                            people_std=('people', np.std))

grouped.reset_index()

Nest renaming is deprecated

You can use a dictionary in .agg to rename your columns then drop column level and reset_index(): See this SO Post

import numpy as np import pandas as pd cities = ['Berlin', 'Oslo'] days = ['Monday', 'Friday'] data = pd.DataFrame({ 'city': np.random.choice(cities, 12), 'day': np.random.choice(days, 12), 'people': np.random.normal(loc=10, size=12), 'cats': np.random.normal(loc=6, size=12)}) grouped = data.groupby(['city', 'day']).agg({'cats':{'cats_mean':np.mean,'cats_std':np.std},'people':{'people_mean':np.mean,'people_std':np.std}}) grouped.columns = grouped.columns.droplevel() grouped.reset_index() ---------------

Output:

     city     day  people_mean  people_std  cats_std  cats_mean
0  Berlin  Friday     9.645190    0.699684  0.973866   6.478510
1  Berlin  Monday     9.556898    0.126810  0.336654   6.624288
2    Oslo  Friday    11.593491         NaN       NaN   6.206595
3    Oslo  Monday    10.202183    1.058651  0.657939   6.019748
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 1
    Please be aware of `FutureWarning: using a dict with renaming is deprecated and will be removed in a future version`. Here is a [link](http://pandas.pydata.org/pandas-docs/version/0.20/whatsnew.html#deprecate-groupby-agg-with-a-dictionary-when-renaming) – MaxU - stand with Ukraine May 08 '17 at 14:02
2

You can perform a list comprehension on the column levels and join with an underscore and then call reset_index:

In [39]:    
grouped.columns= ['_'.join(x) for x in list(zip(grouped.columns.get_level_values(0), grouped.columns.get_level_values(1)))]
grouped = grouped.reset_index()
grouped

Out[39]:
     city     day  cats_mean  cats_std  people_mean  people_std
0  Berlin  Friday   6.140710  0.555981    10.187634    0.359724
1  Berlin  Monday   6.420175  0.986568    10.134376    0.963938
2    Oslo  Friday   6.978572  0.573297    11.345484    1.454762
3    Oslo  Monday   4.594814       NaN    10.842988         NaN
EdChum
  • 376,765
  • 198
  • 813
  • 562