0

I have a dataframe, hourly_subset_df where

hourly_subset_df.columns = ['date', 'metadata.campaignName', 'localSpend.amount', 'daily_cap']

It contains different ad campaigns with their corresponding spend values per hour of the day.

I want to do a cumulative sum per group such that I can see, for each campaign, the evolution of spend throughout the day.

I've tried doing:

hourly_subset_df['cumsum'] = hourly_subset_df.groupby(['metadata.campaignName', 'date', 'daily_cap'])['localSpend.amount'].cumsum()

and

hourly_subset_df.groupby(['metadata.campaignName', 'date', 'daily_cap'])['localSpend.amount'].transform(pd.Series.cumsum)

based on other answer I found here (this, this).

There results I get back is just a copy of the spend column ('localSpend.amount') instead of a cumulative sum across the unique group of ['metadata.campaignName', 'date', 'daily_cap'].

Why is this not working?

EDIT: I've created the sample dataframe as a dict here (was too large to paste here).

mcansado
  • 2,026
  • 4
  • 25
  • 39
  • 5
    Can you include a small sample of your dataframe and your expected output so we can reproduce your issue? – rahlf23 Nov 29 '18 at 17:11
  • 2
    You don't tell us what `hourly_subset_df.daily_cap` is but could it be that there is a one-to-one mapping between values of this column and each row, so that the `df.groupby` object doesn't actually group anything? Mock data would help. – arra Nov 29 '18 at 17:24
  • 2
    My guess would be that you are correctly calculating the `cumsum` but your groups are incorrect. Your grouping columns include `date` which is likely the FULL datetime, meaning that every row basically belongs to its own group, hence the `cumsum` aross groups is the same as the original series. You'll likely need to pull out just the date with `hourly_subset_df.date.dt.date` as your grouping column – ALollz Nov 29 '18 at 17:26
  • I'll try to get the example dataset. I believe the groups are correct as I've checked the groupby separately and it looks as it should. The daily caps are ints which are the same for every date,campaign name combination. – mcansado Nov 29 '18 at 17:29
  • 1
    I second @ALollz. Try hourly_subset_df.groupby(['metadata.campaignName', hourly_subset_df['date'].dt.date, 'daily_cap'])['localSpend.amount'].cumsum() – Vaishali Nov 29 '18 at 17:41
  • Based on the data you provided, @ALollz is correct: each group is unique so the `cumsum` will just be the same as `localSpend.amount`. If you want to see the values change over the day, then you need to group by the day. The above recommendation from @Vaishali accomplishes this. – willk Nov 29 '18 at 17:54

2 Answers2

0

One of the categorical columns you're grouping by might be numerical or unique for each observation so that there is no grouping going on (which is why your grouped.cumsum() gives you the same thing as df.cumsum()). For example, we can replicate the right result one something much like your data like this...

import numpy  as  np
import pandas  as pd 

n_rows = int(1e5)
n_cats = (24, 11, 7)
randcats = lambda ln: np.random.choice(np.arange(ln).astype(int), size=n_rows) 

cols = np.array(['date',
                 'metadata.campaignName', 
                 'localSpend.amount', 
                 'daily_cap'])
groupcols = cols[[0, 1, 3]]

x = np.random.uniform(420.0, 1100.37, size=n_rows)
hrs, camps, caps = map(randcats, n_cats)
df = pd.DataFrame(np.vstack([hrs, camps, x, caps]).T, columns=cols)
df = df.sort_values(by=cols[[1,0,-1]], kind='mergesort')
df['cum_sums'] = df.groupby(groupcols.tolist()).cumsum()
df[groupcols] = df[groupcols].astype(int)
grouped = list(df.groupby(groupcols.tolist()))

So we've configured data that looks like...

      date  metadata.campaignName  localSpend.amount  daily_cap
396      0                      0             526.14          0
2502     0                      0             777.32          0
2587     0                      0             777.40          0
7198     0                      0             423.22          0

Then, you can check the values for various groups...

In [11]: grouped[0][-1].head(4).round(2)
Out[11]:
      date  metadata.campaignName  localSpend.amount  daily_cap  cum_sums
396      0                      0             526.14          0    526.14
2502     0                      0             777.32          0   1303.46
2587     0                      0             777.40          0   2080.86
7198     0                      0             423.22          0   2504.08

In [12]: grouped[1][-1].head(4).round(2)
Out[12]:
      date  metadata.campaignName  localSpend.amount  daily_cap  cum_sums
1382     0                      0             798.77          1    798.77
1430     0                      0             682.32          1   1481.09
1990     0                      0            1083.74          1   2564.83
2870     0                      0             775.08          1   3339.91

In [13]: grouped[int(len(grouped)/2)][-1].head(4).round(2)
Out[13]:
      date  metadata.campaignName  localSpend.amount  daily_cap  cum_sums
1045    12                      0             624.18          0    624.18
3708    12                      0             657.94          0   1282.12
4890    12                      0             595.12          0   1877.23
5326    12                      0             891.11          0   2768.34

which you can verify is right.

arra
  • 136
  • 3
0

In this specific case, I found the cleanest solution to be

hourly_subset_df['cumsum'] = hourly_subset_df\
.groupby(['metadata.campaignName', 'daily_cap'])\
.agg({'localSpend.amount': 'cumsum'})

This makes the cumulative sum work for each group of campaign name / date (hours). It probably requires the dates/hours to be sorted in ascending order which they are in this case.

My issue was that I was including the date col in the group by so each group was completely unique (like @ALollz pointed out).

Also, it works if data is pulled on daily basis (which is my case).

mcansado
  • 2,026
  • 4
  • 25
  • 39