Groupby and sum by 1 column, keep all other columns, and mutate a new column, counting summed rows with pandas

Question

I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.

I have a dataframe with non-unique customers.

  customer_id   amount  male age    income     days reward   difficulty    duration
0   id_1       16.06    1    45     62000.0    608   2.0        10.0         10.0
1   id_1       18.00    1    45     62000.0    608   2.0        10.0         10.0

I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions

Desired output

  customer_id amount   male age    income      days reward   difficulty duration total
 0  id_1       34.06    1    45     62000.0    608   2.0      10.0       10.0      2

My best personal attempt so far does not preserve all columns

groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')

First thing is that you are not grouping by `amount` please check and write more description of what other columns do!. — Space Impact, Aug 11 '20 at 20:12
Thanks, I updated, it was my mistake. Group by customer_id and sum by amount — Anakin Skywalker, Aug 11 '20 at 20:15

Ayoub ZAROU · Answer 1 · 2020-08-11T20:22:13.040

3

you could do:

grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
     'amount': 'size', 
     **first_dict,
})
o['total'] = grouper.size().values

edited Aug 11 '20 at 20:22

answered Aug 11 '20 at 20:12

Ayoub ZAROU

2,387
6
20

SpecificationError: nested renamer is not supported – Anakin Skywalker Aug 11 '20 at 20:17
1

my bad, mispelled male, I didn't get the same error as you, but now it should work as it works for me – Ayoub ZAROU Aug 11 '20 at 20:19
1

which version of pandas you are using ? – Ayoub ZAROU Aug 11 '20 at 20:22
1

I edited my code, it will probably work now regardless of your version – Ayoub ZAROU Aug 11 '20 at 20:24
I updated my question. Maybe you can give a tip how to keep columns in my code? Thanks! – Anakin Skywalker Aug 11 '20 at 20:29
I also think that in your code customer_id is an index now :( – Anakin Skywalker Aug 11 '20 at 20:35
1

Thanks for your time and effort! – Anakin Skywalker Aug 11 '20 at 20:41

score 3 · Accepted Answer · answered Aug 11 '20 at 20:39

3

You could do it this way, include all other columns in your groupby then reset_index after aggregating:

df.groupby(df.columns.difference(['amount']).tolist())['amount']\
  .agg(total_sum='sum',total='count').reset_index()

Output:

   age customer_id  days  difficulty  duration   income  male  reward  total_sum  total
0   45        id_1   608        10.0      10.0  62000.0     1     2.0      34.06      2

answered Aug 11 '20 at 20:39

Scott Boston

147,308
15
139
187

Thanks for your time and effort! That was exactly I was looking for! – Anakin Skywalker Aug 11 '20 at 20:41
Scott, sorry, are you grouping by customer_id or amount? Not sure. I need to group by customer _id. – Anakin Skywalker Aug 11 '20 at 20:48
1

@AnakinSkywalker This is grouping on all column EXCEPT for 'amount' including customer_id. This works a single customer has only one set of values for all other columns except amount. – Scott Boston Aug 11 '20 at 20:49
Scott, how would you do the same, if my unique customers have different values in columns like `difficulty`, `duration`, `reward`? Rows in these columns are not the same and I am trying now to calculate average for `difficulty` and `duration` and sum for `reward`. – Anakin Skywalker Aug 12 '20 at 00:34

Anakin Skywalker · Answer 3 · 2020-08-12T04:42:03.750

Based on @Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).

df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'], 
                                            'duration':['mean'], 'male':['mean'], 'male':['mean'], 
                                            'income':['mean'], 'days':['mean'], 'age':['mean'], 
                                            'customer_id':['count']}).reset_index()

df_grouped = starbucks_grouped.droplevel(1, axis = 1)

My output is

Groupby and sum by 1 column, keep all other columns, and mutate a new column, counting summed rows with pandas

3 Answers3