Using cumsum in pandas on group()

Question

From a Pandas newbie: I have data that looks essentially like this -

 data1=pd.DataFrame({'Dir':['E','E','W','W','E','W','W','E'], 'Bool':['Y','N','Y','N','Y','N','Y','N'], 'Data':[4,5,6,7,8,9,10,11]}, index=pd.DatetimeIndex(['12/30/2000','12/30/2000','12/30/2000','1/2/2001','1/3/2001','1/3/2001','12/30/2000','12/30/2000']))
data1
Out[1]: 
           Bool  Data Dir
2000-12-30    Y     4   E
2000-12-30    N     5   E
2000-12-30    Y     6   W
2001-01-02    N     7   W
2001-01-03    Y     8   E
2001-01-03    N     9   W
2000-12-30    Y    10   W
2000-12-30    N    11   E

And I want to group it by multiple levels, then do a cumsum():

E.g., like running_sum=data1.groupby(['Bool','Dir']).cumsum() <-(Doesn't work)

with output that would look something like:

Bool Dir Date        running_sum
N    E   2000-12-30           16
     W   2001-01-02            7
         2001-01-03           16
Y    E   2000-12-30            4
         2001-01-03           12
     W   2000-12-30           16

My "like" code is clearly not even close. I have made a number of attempts and learned many new things about how not to do this.

Thanks for any help you can give.

bdiamante · Accepted Answer · 2013-04-02T13:38:08.807

Try this:

data2 = data1.reset_index()
data3 = data2.set_index(["Bool", "Dir", "index"])   # index is the new column created by reset_index
running_sum = data3.groupby(level=[0,1,2]).sum().groupby(level=[0,1]).cumsum()

The reason you cannot simply use cumsum on data3 has to do with how your data is structured. Grouping by Bool and Dir and applying an aggregation function (sum, mean, etc) would produce a DataFrame of a smaller size than you started with, as whatever function you used would aggregate values based on your group keys. However cumsum is not an aggreagation function. It wil return a DataFrame that is the same size as the one it's called with. So unless your input DataFrame is in a format where the output can be the same size after calling cumsum, it will throw an error. That's why I called sum first, which returns a DataFrame in the correct input format.

Sorry if I haven't explained this well enough. Maybe someone else could help me out?

Thanks for the solution and explanation. The way I looked at it was to examine the structure of data3.groupby(level=[0,1,2]).sum() which clarified why its necessary to append .groupby(level=[0,1]).cumsum(). I never thought data3.cumsum() would work. — , Aug 29 '15 at 23:54

Malina Kirn · Answer 2 · 2015-09-30T13:23:39.147

As the other answer points out, you're trying to collapse identical dates into single rows, whereas the cumsum function will return a series of the same length as the original DataFrame. Stated differently, you actually want to group by [Bool, Dir, Date], calculate a sum in each group, THEN return a cumsum on rows grouped by [Bool, Dir]. The other answer is a perfectly valid solution to your specific question, here's a one-liner variation:

data1.groupby(['Bool', 'Dir', 'Date']).sum().groupby(level=[0, 1]).cumsum()

This returns output exactly in the requested format.

For those looking for a simple cumsum on a Pandas group, you can use:

data1.groupby(['Bool', 'Dir']).apply(lambda x: x['Data'].cumsum())

The cumulative sum is calculated internal to each group. Here's what the output looks like:

Bool  Dir            
N     E    2000-12-30     5
           2000-12-30    16
      W    2001-01-02     7
           2001-01-03    16
Y     E    2000-12-30     4
           2001-01-03    12
      W    2000-12-30     6
           2000-12-30    16
Name: Data, dtype: int64

Note the repeated dates, but this is doing a strict cumulative sum internal to the rows of each group identified by the Bool and Dir columns.

This method gives me a ValueError coming from lib\site-packages\pandas\core\groupby.pyc line 581 which raises it on AttributeError exception. I'm using numpy-1.9.2 and pandas-0.16.2. — , Aug 29 '15 at 23:36
@TrisNefzger: I reproduced your error with the same versions (I was using pandas 14). I've significantly modified my answer to work with newer pandas. — Malina Kirn, Sep 30 '15 at 13:24

Using cumsum in pandas on group()

2 Answers2

Linked