3

Suppose I have the following DataFrame:

df = pd.DataFrame({'Event': ['A', 'B', 'A', 'A', 'B', 'C', 'B', 'B', 'A', 'C'], 
                   'Date': ['2019-01-01', '2019-02-01', '2019-03-01', '2019-03-01', '2019-02-15', 
                             '2019-03-15', '2019-04-05', '2019-04-05', '2019-04-15', '2019-06-10'],
                   'Sale': [100, 200, 150, 200, 150, 100, 300, 250, 500, 400]})
df['Date'] = pd.to_datetime(df['Date'])
df

Event         Date  Sale
    A   2019-01-01   100
    B   2019-02-01   200
    A   2019-03-01   150
    A   2019-03-01   200
    B   2019-02-15   150
    C   2019-03-15   100
    B   2019-04-05   300
    B   2019-04-05   250
    A   2019-04-15   500
    C   2019-06-10   400

I would like to obtain the following result:

Event         Date  Sale   Total_Previous_Sale
    A   2019-01-01   100                     0
    B   2019-02-01   200                     0
    A   2019-03-01   150                   100
    A   2019-03-01   200                   100
    B   2019-02-15   150                   200
    C   2019-03-15   100                     0
    B   2019-04-05   300                   350
    B   2019-04-05   250                   350
    A   2019-04-15   500                   450
    C   2019-06-10   400                   100

where df['Total_Previous_Sale'] is the total amount of sale (df['Sale']) when the event (df['Event']) takes place before its adjacent date (df['Date']). For instance,

  • The total amount of sale of event A takes place before 2019-01-01 is 0,
  • The total amount of sale of event A takes place before 2019-03-01 is 100, and
  • The total amount of sale of event A takes place before 2019-04-15 is 100 + 150 + 200 = 450.

Basically, it is almost the same like conditional cumulative sum but only for all previous values (excluding current value[s]). I am able to obtain the desired result using this line:

df['Sale_Total'] = [df.loc[(df['Event'] == df.loc[i, 'Event']) & (df['Date'] < df.loc[i, 'Date']), 
                           'Sale'].sum() for i in range(len(df))]

Although, it is slow but it works fine. I believe there is a better and faster way to do that. I have tried these lines:

df['Total_Previuos_Sale'] = df[df['Date'] < df['Date']].groupby(['Event'])['Sale'].cumsum()

or

df['Total_Previuos_Sale'] = df.groupby(['Event'])['Sale'].shift(1).cumsum().fillna(0)

but it produces NaNs or comes up with an unwanted result.

2 Answers2

2

First aggregate sum per Event and Date for MultiIndex, then grouping by first level Event and use shift with cumulative sum with lambda function and last join together:

s = (df.groupby(['Event', 'Date'])['Sale']
       .sum().groupby(level=0)
       .apply(lambda x: x.shift(1).cumsum())
       .fillna(0)

df = df.join(s.rename('Total_Previuos_Sale'), on=['Event','Date'])
print (df)
  Event        Date  Sale  Total_Previuos_Sale
0     A  2019-01-01   100                  0.0
1     B  2019-02-01   200                  0.0
2     A  2019-03-01   150                100.0
3     A  2019-03-01   200                100.0
4     B  2019-02-15   150                200.0
5     C  2019-03-15   100                  0.0
6     B  2019-04-05   300                350.0
7     B  2019-04-05   250                350.0
8     A  2019-04-15   500                450.0
9     C  2019-06-10   400                100.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Ah, thank you for the answer. It seems that your technique can be applied to answer [this question of mine](https://stackoverflow.com/q/58066706/3397819) as well. Would you be kind enough to answer it, please? – Anastasiya-Romanova 秀 Sep 24 '19 at 07:48
1

Finally, I can find a better and faster way to get the desired result. It turns out that it is very easy. One can try:

df['Total_Previous_Sale'] = df.groupby('Event')['Sale'].cumsum() \
                          - df.groupby(['Event', 'Date'])['Sale'].cumsum()