10

I am trying to understand what the equivalent of this simple SQL statement would be:

select mykey, sum(Field1) as sum_of_field1, avg(Field1) as avg_field1, min(field2) as min_field2
from df
group by mykey

I understand I can passa a dictionary to the agg() function:

  f = {'Field1':'sum',
         'Field2':['max','mean'],
         'Field3':['min','mean','count'],
         'Field4':'count'
         }

    grouped = df.groupby('mykey').agg(f)

However, the resulting column names seem to be chosen by pandas automatically: ('Field1','sum') etc.

Is there a way to pass strings for column names, so that the field is not ('Field1','sum') but something I can choose, like sum_of_field1 ?

Thanks. I looked at the docs here: http://pandas.pydata.org/pandas-docs/stable/groupby.html but couldn't quite find an answer.

Pythonista anonymous
  • 8,140
  • 20
  • 70
  • 112

2 Answers2

20

As of pandas 0.25, this is possible with a "Named aggregation".

In [79]: animals = pd.DataFrame({'kind': ['cat', 'dog', 'cat', 'dog'],
   ....:                         'height': [9.1, 6.0, 9.5, 34.0],
   ....:                         'weight': [7.9, 7.5, 9.9, 198.0]})
   ....: 

In [80]: animals
Out[80]: 
  kind  height  weight
0  cat     9.1     7.9
1  dog     6.0     7.5
2  cat     9.5     9.9
3  dog    34.0   198.0

In [82]: animals.groupby("kind").agg(
   ....:     min_height=('height', 'min'),
   ....:     max_height=('height', 'max'),
   ....:     average_weight=('weight', np.mean),
   ....: )
   ....: 
Out[82]: 
      min_height  max_height  average_weight
kind                                        
cat          9.1         9.5            8.90
dog          6.0        34.0          102.75

The previously deprecated version follows:


You can pass a dictionary of dictionaries to .agg mapping {column: {name: aggfunc}}, for example

In [46]: df.head()
Out[46]:
   Year  qtr  realgdp  realcons  realinvs  realgovt  realdpi  cpi_u      M1  \
0  1950    1   1610.5    1058.9     198.1     361.0   1186.1   70.6  110.20
1  1950    2   1658.8    1075.9     220.4     366.4   1178.1   71.4  111.75
2  1950    3   1723.0    1131.0     239.7     359.6   1196.5   73.2  112.95
3  1950    4   1753.9    1097.6     271.8     382.5   1210.0   74.9  113.93
4  1951    1   1773.5    1122.8     242.9     421.9   1207.9   77.3  115.08

   tbilrate  unemp      pop     infl  realint
0      1.12    6.4  149.461   0.0000   0.0000
1      1.17    5.6  150.260   4.5071  -3.3404
2      1.23    4.6  151.064   9.9590  -8.7290
3      1.35    4.2  151.871   9.1834  -7.8301
4      1.40    3.5  152.393  12.6160 -11.2160

In [47]: df.groupby('qtr').agg({"realgdp": {"mean_gdp": "mean", "std_gdp": "std"},
                                "unemp": {"mean_unemp": "mean"}})
Out[47]:
         realgdp                   unemp
        mean_gdp      std_gdp mean_unemp
qtr
1    4506.439216  2104.195963   5.694118
2    4546.043137  2121.824090   5.686275
3    4580.507843  2132.897955   5.662745
4    4617.592157  2158.132698   5.654902

The result has a MultiIndex in the columns. If you don't want that outer level, you can use .columns.droplevel(0).

TomAugspurger
  • 28,234
  • 8
  • 86
  • 69
  • 1
    Thanks! Super useful! You should have this added to the official documentation (unless it's already there and I failed to find it) – Pythonista anonymous Sep 03 '15 at 12:31
  • 2
    It's not in the docs, but there's an issue to document it correctly here: https://github.com/pydata/pandas/issues/9052 A pull-request adding it would be great, if you're up for it! – TomAugspurger Sep 03 '15 at 15:21
  • 1
    This is no longer possible. Python now produces a warning that "using a dict with renaming is deprecated and will be removed in a future version". Why, oh, please someone explain why!!!!!! – Pythonista anonymous Oct 11 '17 at 17:02
1

I agree this is a bit frustrating butI do find chaining with a rename method served my purpose. Also, when it gets really complex, I will just reset the column names. It is a MultiIndex so it is immutable, and you should feel comfortable dealing with levels.

Based on the pandas documentation

The resulting aggregations are named for the functions themselves. If you need to rename, then you can add in a chained operation for a Series like this

In [67]: (grouped['C'].agg([np.sum, np.mean, np.std])
   ....:              .rename(columns={'sum': 'foo',
   ....:                               'mean': 'bar',
   ....:                               'std': 'baz'})
   ....: )
   ....: 
Out[67]: 
          foo       bar       baz
A                                
bar  0.392940  0.130980  0.181231
foo -1.796421 -0.359284  0.912265

When there are multiples uses of one function and you want to name it differently, this question of dropping the level and joining the different levels by underscore will help.

If you do find the sql syntax cleaner, there is a library called pandasql that give you this flexibility.

B.Mr.W.
  • 18,910
  • 35
  • 114
  • 178