64

I'm trying to do some aggregations on a pandas data frame. Here is a sample code:

import pandas as pd

df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})

df.groupby(["User"]).agg({"Amount": {"Sum": "sum", "Count": "count"}})

Out[1]: 
      Amount      
         Sum Count
User              
user1   18.0     2
user2   20.5     3
user3   10.5     1

Which generates the following warning:

FutureWarning: using a dict with renaming is deprecated and will be removed in a future version return super(DataFrameGroupBy, self).aggregate(arg, *args, **kwargs)

How can I avoid this?

smci
  • 32,567
  • 20
  • 113
  • 146
Victor Mayrink
  • 1,064
  • 1
  • 13
  • 24
  • 10
    I'd love to know why this is being depreciated (I'm sure there is a good reason). Does anyone have a link to a discussion on it? – Stephen McAteer Nov 24 '17 at 05:31
  • To focus on the keywords of the solution rather than just the existing warning, I retitled ***"rename result columns from aggregation"*** and tagged. Now people might even find this question :) ahead of (say) the not-so-canonical[Naming returned columns in Pandas aggregate function?](https://stackoverflow.com/questions/19078325/naming-returned-columns-in-pandas-aggregate-function) – smci May 08 '18 at 23:32
  • 2
    Hopefully this will be addressed in https://github.com/pandas-dev/pandas/issues/18366 – Nickolay May 28 '18 at 08:06
  • How would this work if I don't do a "groupby" but I'm doing "pivot" instead? – avloss Jun 11 '18 at 12:57
  • See pandas 0.25.0 [aggregation relabeling](https://pandas.pydata.org/pandas-docs/stable/whatsnew/v0.25.0.html#groupby-aggregation-with-relabeling) – Scott Boston Jul 30 '19 at 20:30

6 Answers6

94

Use groupby apply and return a Series to rename columns

Use the groupby apply method to perform an aggregation that

  • Renames the columns
  • Allows for spaces in the names
  • Allows you to order the returned columns in any way you choose
  • Allows for interactions between columns
  • Returns a single level index and NOT a MultiIndex

To do this:

  • create a custom function that you pass to apply
  • This custom function is passed each group as a DataFrame
  • Return a Series
  • The index of the Series will be the new columns

Create fake data

df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
                  'Score': [9, 1, 8, 7, 7, 6, 9]})

enter image description here

create custom function that returns a Series
The variable x inside of my_agg is a DataFrame

def my_agg(x):
    names = {
        'Amount mean': x['Amount'].mean(),
        'Amount std':  x['Amount'].std(),
        'Amount range': x['Amount'].max() - x['Amount'].min(),
        'Score Max':  x['Score'].max(),
        'Score Sum': x['Score'].sum(),
        'Amount Score Sum': (x['Amount'] * x['Score']).sum()}

    return pd.Series(names, index=['Amount range', 'Amount std', 'Amount mean',
                                   'Score Sum', 'Score Max', 'Amount Score Sum'])

Pass this custom function to the groupby apply method

df.groupby('User').apply(my_agg)

enter image description here

The big downside is that this function will be much slower than agg for the cythonized aggregations

Using a dictionary with groupby agg method

Using a dictionary of dictionaries was removed because of its complexity and somewhat ambiguous nature. There is an ongoing discussion on how to improve this functionality in the future on github Here, you can directly access the aggregating column after the groupby call. Simply pass a list of all the aggregating functions you wish to apply.

df.groupby('User')['Amount'].agg(['sum', 'count'])

Output

       sum  count
User              
user1  18.0      2
user2  20.5      3
user3  10.5      1

It is still possible to use a dictionary to explicitly denote different aggregations for different columns, like here if there was another numeric column named Other.

df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
              "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0],
              'Other': [1,2,3,4,5,6]})

df.groupby('User').agg({'Amount' : ['sum', 'count'], 'Other':['max', 'std']})

Output

      Amount       Other          
         sum count   max       std
User                              
user1   18.0     2     6  3.535534
user2   20.5     3     5  1.527525
user3   10.5     1     4       NaN
Ted Petrou
  • 59,042
  • 19
  • 131
  • 136
  • 21
    But suppose you do want the renaming aspect as well, to have different column names than the default after aggregation. Is there some syntax to still get this functionality? – ErnestScribbler Sep 25 '17 at 14:13
  • 2
    Had same question about naming, since I am using the same col twice (one min and one max) and need a way to refer to them uniquely when I put results back in objects. – mgmonteleone Oct 09 '17 at 22:40
  • 1
    If you want to rename the columns you will have to do it manually. Either use a list to replace all of them `df.columns = ['your', 'new', 'col', 'names']` or use the `rename` method which will be a bit difficult since you have this results in a MultiIndex. – Ted Petrou Oct 09 '17 at 22:44
  • @ErnestScribbler I have updated this answer to show a fairly simple way of performing an aggregation, renaming and returning a single-level index. – Ted Petrou Dec 07 '17 at 14:45
  • 2
    How would the apply method work with 'first' and 'last'? – Gregory Saxton Feb 16 '18 at 19:14
  • why don't you need to pass an argument to my_agg when you call it? – Snow Sep 27 '18 at 11:20
  • @Snow .agg() can take a function that as parameter. – CodeMonkey Jan 29 '19 at 14:52
  • I entered wrong column and got the same deprecated warning `df.groupby('User')['Amount'].agg({'Amount' : ['sum', 'count'], 'Other':['max', 'std']})` – user3226167 Aug 20 '19 at 06:50
21

Update for Pandas 0.25+ Aggregation relabeling

import pandas as pd

print(pd.__version__)
#0.25.0

df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})

df.groupby("User")['Amount'].agg(Sum='sum', Count='count')

Output:

        Sum  Count
User              
user1  18.0      2
user2  20.5      3
user3  10.5      1
Community
  • 1
  • 1
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 5
    But in this case, the names `Sum` and `Count` have to be valid Python names. You couldn't replace `Sum` with `Sum of foos`. :( – Dror Jan 10 '20 at 13:18
  • 2
    @Dror you can use this format `df.groupby('User').agg(**{'sum of foos':pd.NamedAgg('Amount','sum'), 'count of foos':pd.NamedAgg('Amount','count')})` – Scott Boston Jan 10 '20 at 17:25
  • Indeed, this works, but it unfortunately, it is so much more cumbersome when compared to the dictionary based aggregation API. – Dror Jan 13 '20 at 07:07
  • 2
    @Dror You don't need to include the `pd.NamedAgg` part which shortens the amount of code. This is now my preferred way of doing this. Pass in a dictionary with `**` in front of it. – Corey Levinson Dec 11 '20 at 23:02
17

If you replace the internal dictionary with a list of tuples it gets rid of the warning message

import pandas as pd

df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0]})

df.groupby(["User"]).agg({"Amount": [("Sum", "sum"), ("Count", "count")]})
Jacob Stevenson
  • 3,718
  • 1
  • 13
  • 10
  • Do you know if this will also (presumably) work in the future release or is also just to workaround the warning message? – Peanut Jun 06 '18 at 12:42
  • @Peanut, I do not know. But if the warning message is not there, then presumably (as you say) it will continue to be supported. – Jacob Stevenson Jun 06 '18 at 16:30
  • 8
    This is an undocumented and accidental feature and I would highly suggest no one using this syntax as it may not work in the future. – Ted Petrou Jun 13 '18 at 12:27
  • Thanks for the information @TedPetrou. And thanks for the link to the discussion in your answer. It sounds like a difficult question to find the right syntax. – Jacob Stevenson Jun 13 '18 at 16:35
  • Is this indeed unstable behavior as @TedPetrou mentioned? – Dror Jan 10 '20 at 13:19
12

This worked for me, Pandas version 1.2.4

For each column we add a list which consists of tuples:

df.groupby('column to group by').agg(
{'column name': [('new column name', 'function to apply')]})

Example

# Create DataFrame
df=pd.DataFrame(data={'id':[1,1,2,3],'col1': [1,2,1,5], 'col2':[5,8,6,4]})

# Apply grouping 
grouped = df.groupby('id').agg({
                             'col1': [('name1', 'sum')], 
                             'col2': [('name2_mean', 'sum'), ('name2_custom_std', lambda x: np.std(x))]})

# Drop multi-index for columns and reset index
grouped.columns = grouped.columns.droplevel()
grouped.reset_index(inplace=True)

Result:

id name1 name2_mean name2_custom_std
0 1 3 13 1.5
1 2 1 6 0.0
2 3 5 4 0.0
Anna
  • 121
  • 1
  • 4
3

This is what I did:

Create a fake dataset:

import pandas as pd
df = pd.DataFrame({"User": ["user1", "user2", "user2", "user3", "user2", "user1", "user3"],
                  "Amount": [10.0, 5.0, 8.0, 10.5, 7.5, 8.0, 9],
                  'Score': [9, 1, 8, 7, 7, 6, 9]})
df

O/P:

    Amount  Score   User
0   10.0    9   user1
1   5.0 1   user2
2   8.0 8   user2
3   10.5    7   user3
4   7.5 7   user2
5   8.0 6   user1
6   9.0 9   user3

I first made the User the index, and then a groupby:

ans = df.set_index('User').groupby(level=0)['Amount'].agg([('Sum','sum'),('Count','count')])
ans

Solution:

    Sum Count
User        
user1   18.0    2
user2   20.5    3
user3   19.5    2
JodeCharger100
  • 903
  • 1
  • 12
  • 25
0

Replace the inner dictionaries with a list of correctly named functions.

To rename the function I'm using this utility function:

def aliased_aggr(aggr, name):
    if isinstance(aggr,str):
        def f(data):
            return data.agg(aggr)
    else:
        def f(data):
            return aggr(data)
    f.__name__ = name
    return f

The group-by statement then becomes:


df.groupby(["User"]).agg({"Amount": [ 
    aliased_aggr("sum","Sum"),
    aliased_aggr("count","Count")
]

If you have bigger, reusable aggregation specs, you can convert them with

def convert_aggr_spec(aggr_spec):
    return {
        col : [ 
            aliased_aggr(aggr,alias) for alias, aggr in aggr_map.items() 
        ]  
        for col, aggr_map in aggr_spec.items() 
    }

So you can say

df.groupby(["User"]).agg(convert_aggr_spec({"Amount": {"Sum": "sum", "Count": "count"}}))

See also https://github.com/pandas-dev/pandas/issues/18366#issuecomment-476597674

plankthom
  • 111
  • 1
  • 5
  • Ok, I got carried away with this: this is essentially recreating what already is in pandas, as illustrated in [this solution](https://stackoverflow.com/a/50697003/1419571) proposed by @jacob-stevenson – plankthom Mar 26 '19 at 12:50
  • ... but there are other circumstances where it is useful to carry the aliases in the aggregation spec items themselves. So i leave the answer as is. – plankthom Mar 26 '19 at 12:57