3

I have the following dataframe from a dataset containing 0.3 million rows:

    CustomerID  Revenue
0   17850.0     15.30
1   17850.0     11.10
2   13047.0     17.85
3   13047.0     17.85
4   17850.0     20.34
5   13047.0     12.60
6   13047.0     12.60
7   13047.0     31.80
8   17850.0     20.34
9   17850.0     15.30
10  13047.0     9.90
11  13047.0     30.00
12  13047.0     31.80
13  12583.0     40.80
14  12583.0     39.60
15  13047.0     14.85
16  13047.0     14.85
17  12583.0     15.60
18  12583.0     45.00
19  12583.0     70.80

The CustomerID values repeat in batches. For example the CustomerID value of 17850 contained in the first 2 rows may appear again later at some some point in the datset. I am trying to group subsets of rows by the same Customer ID and sum up the revenue for that bunch. The dataframe transformation that i want to do should look like this:

   CustomerID   TotalRevenue
0   17850.0      26.40
1   13047.0      35.70
2   17850.0      20.34
3   13047.0      57.0
4   17850.0      35.64
5   13047.0      71.7
6   12583.0      80.4
7   13047.0      29.7
8   12583.0     131.4

The problem is if i use the groupby method it groups all the rows with the same CustomerID value. So this way it groups together all the 17850 CustomerID values in the whole dataframe and not just as a bunch of first 2 rows and then subsequent bunches of other CustomerID values.

Will really appreciate some help how to do this using Pandas. Thanks

cs95
  • 379,657
  • 97
  • 704
  • 746
  • So let me try to understand what you want: You want to write code that recognizes a change in customerID and groups everything between subsequent changes together? – Mitchell Faas Jul 24 '17 at 16:16
  • Possible duplicate of [Pandas DataFrame: How to groupby consecutive values](https://stackoverflow.com/questions/40802800/pandas-dataframe-how-to-groupby-consecutive-values) – asongtoruin Jul 24 '17 at 16:18
  • @asongtoruin I don't think so. They may be similar but not the same. – cs95 Jul 24 '17 at 17:38

3 Answers3

3
df.groupby(['CustomerID',df.CustomerID.diff().ne(0).cumsum()],sort=False)['Revenue'].sum().rename_axis(['CustomerID','GID']).reset_index().drop('GID',axis=1)

Output:

   CustomerID  Revenue
0     17850.0    26.40
1     13047.0    35.70
2     17850.0    20.34
3     13047.0    57.00
4     17850.0    35.64
5     13047.0    71.70
6     12583.0    80.40
7     13047.0    29.70
8     12583.0   131.40
Scott Boston
  • 147,308
  • 15
  • 139
  • 187
  • 2
    I think you can add this as another solution`df.groupby(['CustomerID',((df.CustomerID != df.CustomerID.shift()).cumsum())]` – BENY Jul 24 '17 at 17:21
  • @ScottBoston This seems to work perfectly and gives the desired output. Thanks alot. – Yasar Qamar Jul 24 '17 at 17:26
0
import pandas as pd

# df <- I am assuming that df contais you data

result = df.groupby('CustomerID').sum().rename(columns={'Revenue': 'TotalRevenue'})
biniow
  • 391
  • 1
  • 10
0

Know this is late and I have not tested this out on your example, but I thought the accepted answer's readability was low. This is not only legible, but provides the ability to (1) persist class DataFrame, (2) apply averages, counts, summations and (3) enables groupby on multiple columns.

df.groupby(['att1', 'att2']).agg({'att1': "count", 'att3': "sum",'att4': 'mean'})

using your values...

df.groupby(['User ID']).agg({'Revenue': "sum"})
xxyjoel
  • 551
  • 5
  • 7