52

I would like to group rows in a dataframe, given one column. Then I would like to receive an edited dataframe for which I can decide which aggregation function makes sense. The default should be just the value of the first entry in the group.

(it would be nice if the solution also worked for a combination of two columns)

Example

#!/usr/bin/env python

"""Test data frame grouping."""

# 3rd party modules
import pandas as pd


df = pd.DataFrame([{'id': 1, 'price': 123, 'name': 'anna', 'amount': 1},
                   {'id': 1, 'price':   7, 'name': 'anna', 'amount': 2},
                   {'id': 2, 'price':  42, 'name': 'bob', 'amount': 30},
                   {'id': 3, 'price':   1, 'name': 'charlie', 'amount': 10},
                   {'id': 3, 'price':   2, 'name': 'david', 'amount': 100}])
print(df)

gives the dataframe:

   amount  id     name  price
0       1   1     anna    123
1       2   1     anna      7
2      30   2      bob     42
3      10   3  charlie      1
4     100   3    david      2

And I would like to get:

amount  id     name  price
     3   1     anna    130
    30   2      bob     42
   110   3  charlie      3

So:

  • Entries with the same value in the id column belong together. After that operation, there should still be an id column, but it should have only unique values.
  • All values in amount and price which have the same id get summed up
  • For name, just the first one (by the current order of the dataframe) is taken.

Is this possible with Pandas?

Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
  • 5
    What's wrong with `df_new = df.groupby(df['id']).aggregate({'price': 'sum', 'name': 'first', 'amount': 'sum'})`? Does that not work for your use case? – cs95 Oct 19 '17 at 09:31
  • 1
    Hahaha, ok, I didn't try it. I just thought this is how a function should look like. Nice that it accidentially actually works. I'll edit my question and make that an answer. – Martin Thoma Oct 19 '17 at 10:17

2 Answers2

70

You are looking for

aggregation_functions = {'price': 'sum', 'amount': 'sum', 'name': 'first'}
df_new = df.groupby(df['id']).aggregate(aggregation_functions)

which gives

    price     name  amount
id                        
1     130     anna       3
2      42      bob      30
3       3  charlie     110
Martin Thoma
  • 124,992
  • 159
  • 614
  • 958
23

For same columns ordering is necessary add reindex, because aggregate by dict:

d = {'price': 'sum', 'name': 'first', 'amount': 'sum'}
df_new = df.groupby('id', as_index=False).aggregate(d).reindex(columns=df.columns)
print (df_new)
   amount  id     name  price
0       3   1     anna    130
1      30   2      bob     42
2     110   3  charlie      3
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252