1

I have a data set like,

data=pd.DataFrame({'id':pd.Series([1,1,1,2,2,3,3,3]),'var1':pd.Series([1,2,3,4,5,6,7,8]),'var2':pd.Series([11,12,13,14,15,16,17,18]),
'var3':pd.Series([21,22,23,24,25,26,27,28])})

Here I need to calculate groupwise cumulative sum for all columns(var1,var2,var3) based on id. How can I write python code to crate output as per my requirement?

Thanks in advance.

RSK
  • 751
  • 2
  • 7
  • 18

2 Answers2

2

If I have understood you right, you can use DataFrame.groupby to calculate the cumulative sum across columns grouped by your 'id'-column. Something like:

import pandas as pd
data=pd.DataFrame({'id':[1,1,1,2,2,3,3,3],'var1':[1,2,3,4,5,6,7,8],'var2':[11,12,13,14,15,16,17,18], 'var3':[21,22,23,24,25,26,27,28]})
data.groupby('id').apply(lambda x: x.drop('id', axis=1).cumsum(axis=1).sum())
RickardSjogren
  • 4,070
  • 3
  • 17
  • 26
1

I am not familiar with the pd object's identity that you have used, but the way I understand your question is you have a list of labels (denoted id in your code) that correspond to several lists of equal length (denoted var1, var2, and var3 in your code), and that you want to sum the items sharing the same label, doing this for each label, and return the result.

The following code solves the general problem (assuming your array of labels is sorted):

def cumsum(A):
 from operator import add
 return reduce(add, A) # cumulative sum of array A

def cumsumlbl(A, lbl):
 idx = [lbl.index(item) for item in set(lbl)] # begin index of each lbl subsequence
 idx.append(len(lbl)) # last index doesn't get added in the above line

 return [cumsum(A[i:j]) for (i,j) in zip(idx[:-1], idx[1:])]

Or to use a modified version of Markus Jarderot's code that appears here:

def cumsum(A):
 from operator import add
 return reduce(add, A)

def doublet(iterable):
 iterator = iter(iterable)
 item = iterator.next()
 for next in iterator:
  yield (item,next)
  item = next

def cumsumlbl(A, lbl):
 idx = [lbl.index(item) for item in set(lbl)]
 idx.append(len(lbl))
 dbl = doublet(idx) # generator for successive, overlapping pairs of indices

 return [cumsum(A[i:j]) for (i,j) in dbl]

And to test:

if __name__ == '__main__'
 A = [1, 2, 3, 4, 5, 6]
 lbl = [1, 1, 2, 2, 2, 3]
 print cumsumlbl(A, lbl)

Output:

[3, 12, 6]
Community
  • 1
  • 1
Sia
  • 919
  • 1
  • 8
  • 18