3

I have a pandas data frame that I want to group by two columns and then return the cumulative sum of a third column of strings as a list within one of these groups.

Example:

Year  Bucket  Name
2000    1      A
2001    1      B
2003    1      C
2000    2      B
2002    2      C

The output I want is:

Year  Bucket  Cum_Sum
2000    1      [A]
2001    1      [A,B]
2002    1      [A,B]
2003    1      [A,B,C]
2000    2      [B]
2001    2      [B]
2002    2      [B,C]
2003    2      [B,C]

I tried to piece together an answer from two responses: https://stackoverflow.com/a/39623235/5143841 https://stackoverflow.com/a/22651188/5143841

But I can't quite get there.

JBN
  • 67
  • 4

1 Answers1

4

My Dr. Frankenstein Answer

dat = []
rng = range(df.Year.min(), df.Year.max() + 1)
for b, d in df.groupby('Bucket'):
    for y in rng:
        dat.append([y, b, [*d.Name[d.Year <= y]]])
        
pd.DataFrame(dat, columns=[*df])

   Year  Bucket       Name
0  2000       1        [A]
1  2001       1     [A, B]
2  2002       1     [A, B]
3  2003       1  [A, B, C]
4  2000       2        [B]
5  2001       2        [B]
6  2002       2     [B, C]
7  2003       2     [B, C]

Another freaky answer

rng = range(df.Year.min(), df.Year.max() + 1)
i = [(y, b) for b, d in df.groupby('Bucket') for y in rng]
s = df.set_index(['Year', 'Bucket']).Name.map(lambda x: [x])
s.reindex(i, fill_value=[]).groupby(level=1).apply(pd.Series.cumsum).reset_index()

   Year  Bucket       Name
0  2000       1        [A]
1  2001       1     [A, B]
2  2002       1     [A, B]
3  2003       1  [A, B, C]
4  2000       2        [B]
5  2001       2        [B]
6  2002       2     [B, C]
7  2003       2     [B, C]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thanks, I don't mind Frankenstein-ing this! Note that, in my original post, I unnecessarily asked for an imbalanced panel as an output (i.e. up to year 2003 for bucket 1 and up to year 2002 for bucket 2). I edited that now! – JBN Mar 17 '21 at 21:32