1

Having a pandas data frame:

    date        path    size
0   2019-05-10  /bar/A  3
1   2019-05-10  /bar/B  7
2   2019-05-10  /bar/C  2
3   2019-05-14  /bar/A  4
4   2019-05-14  /bar/B  8
5   2019-05-14  /bar/C  23
6   2019-05-18  /bar/A  11
7   2019-05-18  /bar/B  75
8   2019-05-18  /bar/C  32

I would like to groupby "path" and return the cumulative sum of the column "size" for each "date"

Looking at this answer: Pandas groupby cumulative sum

a simple df.groupby(["path"])["size"].cumsum() or df.groupby(["path","date"])["size"].cumsum() will not work.

In the end the cumulative sum should be plotted by date and colored by group to indicate the accumulated growth of "size" over time.

            /bar/A /bar/B /bar/C
2019-05-10  3      7      2
2019-05-14  7      15     26
2019-05-18  18     90     58

Is there any pandas-based solution without seaborn or other tools?

Fourier
  • 2,795
  • 3
  • 25
  • 39

1 Answers1

4

I think you can achieve that pivoting the table and then applying the cumulative sum.

pivot = pd.pivot_table(df, values="size", index=["date"], columns=["path"], aggfunc=np.sum)
pivot = pivot.cumsum()

See the results, based on the example of your question:

df
Out[14]: 
         date    path  size
0  2019-05-10  /bar/A     3
1  2019-05-10  /bar/B     7
2  2019-05-10  /bar/C     2
3  2019-05-14  /bar/A     4
4  2019-05-14  /bar/B     8
5  2019-05-14  /bar/C    23
6  2019-05-18  /bar/A    11
7  2019-05-18  /bar/B    75
8  2019-05-18  /bar/C    32
pivot = pd.pivot_table(df, values="size", index=["date"], columns=["path"], aggfunc=np.sum)
pivot.cumsum()
Out[16]: 
path        /bar/A  /bar/B  /bar/C
date                              
2019-05-10       3       7       2
2019-05-14       7      15      25
2019-05-18      18      90      57
edgarzamora
  • 1,472
  • 1
  • 9
  • 17
  • Close. The trick is to use `index=df.date.dt.month` for pivoting. – Fourier Nov 07 '19 at 16:47
  • 1
    Well, in then you are grouping by months and not by days, right? – edgarzamora Nov 07 '19 at 17:01
  • Indeed, in my original post I had included time in hours/mins/sec. In this case, the approach would not work. – Fourier Nov 07 '19 at 17:03
  • 1
    Well, I miss your original post then. But see my last edit. Based on your example it works as expected. But if you have datetimes, instead of dates, then you can decide to just pick hours, days, months, years, it's up to you. But I think the solution I provided is what you was looking for :D – edgarzamora Nov 07 '19 at 17:05
  • 1
    More than fine. Thank you, this works. As a workaround: `df["date"].dt.date` would be universal enough. – Fourier Nov 07 '19 at 17:07