-1

I am new to Python and can see at least 5 similar questions and this one is very close but non of them work for me.

I have a dataframe with non-unique customers.

  customer_id   amount  male age    income     days reward   difficulty    duration
0   id_1       16.06    1    45     62000.0    608   2.0        10.0         10.0
1   id_1       18.00    1    45     62000.0    608   2.0        10.0         10.0

I am trying to group them by customer_id, sum by amount and keep all other columns PLUS add one column total, counting my transactions

Desired output

  customer_id amount   male age    income      days reward   difficulty duration total
 0  id_1       34.06    1    45     62000.0    608   2.0      10.0       10.0      2

My best personal attempt so far does not preserve all columns

groupby('customer_id')['amount'].agg(total_sum = 'sum', total = 'count')

enter image description here

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63

3 Answers3

3

you could do:

grouper = df.groupby('customer_id')
first_dict = {col: 'first' for col in df.columns.difference(['customer_id', 'amount'])}
o = grouper.agg({
     'amount': 'size', 
     **first_dict,
})
o['total'] = grouper.size().values
Ayoub ZAROU
  • 2,387
  • 6
  • 20
3

You could do it this way, include all other columns in your groupby then reset_index after aggregating:

df.groupby(df.columns.difference(['amount']).tolist())['amount']\
  .agg(total_sum='sum',total='count').reset_index()

Output:

   age customer_id  days  difficulty  duration   income  male  reward  total_sum  total
0   45        id_1   608        10.0      10.0  62000.0     1     2.0      34.06      2
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • Thanks for your time and effort! That was exactly I was looking for! – Anakin Skywalker Aug 11 '20 at 20:41
  • Scott, sorry, are you grouping by customer_id or amount? Not sure. I need to group by customer _id. – Anakin Skywalker Aug 11 '20 at 20:48
  • 1
    @AnakinSkywalker This is grouping on all column EXCEPT for 'amount' including customer_id. This works a single customer has only one set of values for all other columns except amount. – Scott Boston Aug 11 '20 at 20:49
  • Scott, how would you do the same, if my unique customers have different values in columns like `difficulty`, `duration`, `reward`? Rows in these columns are not the same and I am trying now to calculate average for `difficulty` and `duration` and sum for `reward`. – Anakin Skywalker Aug 12 '20 at 00:34
0

Based on @Scott Boston's answer, I found an answer myself too and I acknowledge that my solution is not elegant (maybe something will help to clean it). But it gives me an expanded solution, when I have non-unique rows (for instance, each customer_id has five different transactions).

df.groupby('customer_id').agg({'amount':['sum'], 'reward_':['sum'], 'difficulty':['mean'], 
                                            'duration':['mean'], 'male':['mean'], 'male':['mean'], 
                                            'income':['mean'], 'days':['mean'], 'age':['mean'], 
                                            'customer_id':['count']}).reset_index()

df_grouped = starbucks_grouped.droplevel(1, axis = 1) 

My output is

Anakin Skywalker
  • 2,400
  • 5
  • 35
  • 63