Calculate cumulative column with calculation

Question

I have the following dataframe:

data = {'month': {0: Timestamp('2019-01-01 00:00:00'),
  1: Timestamp('2019-02-01 00:00:00'),
  2: Timestamp('2019-03-01 00:00:00'),
  3: Timestamp('2019-04-01 00:00:00'),
  4: Timestamp('2019-05-01 00:00:00')},
 'base_expenses': {0: 200.0, 1: 200.0, 2: 200.0, 3: 200.0, 4: 200.0},
 'base_contribution': {0: 100.0, 1: 100.0, 2: 100.0, 3: 100.0, 4: 100.0}}

df = pd.DataFrame(data)
df

month   base_expenses   base_contribution
0   2019-01-01  200.0   100.0
1   2019-02-01  200.0   100.0
2   2019-03-01  200.0   100.0
3   2019-04-01  200.0   100.0
4   2019-05-01  200.0   100.0

This data will represent an investment which has additional contributions added to it every month and which grows by a certain percentage every month.

So for example the starting balance of the investment is 50000. Every month we add base_contribution to the balance. Lastly, every month the balance grows by 0.6%.

I can calculate all this using a loop as follows:

CURRENT_BALANCE = 50000
MONTHLY_INVESTMENT_RETURN = 0.006

df['base_balance'] = CURRENT_BALANCE
for index, row in df.iterrows():
    if index == 0:
        balance = row['base_contribution'] + row['base_balance']
        balance += balance * MONTHLY_INVESTMENT_RETURN
        df.loc[row.name, 'base_balance'] = balance
    else:
        balance = row['base_contribution'] + df.loc[row.name - 1, 'base_balance']
        balance += balance * MONTHLY_INVESTMENT_RETURN
        df.loc[row.name, 'base_balance'] = balance

The result would be:

    month   base_expenses   base_contribution   base_balance
0   2019-01-01  200.0   100.0   50422.344909
1   2019-02-01  200.0   100.0   50847.407197
2   2019-03-01  200.0   100.0   51275.204349
3   2019-04-01  200.0   100.0   51705.753960
4   2019-05-01  200.0   100.0   52139.073741

The real data that I am working with is very large and so I would prefer to avoid this looping approach if possible. Is there a way to do this in a vectorised manor or without looping?

According to your code, for the first line I get : (50000+100)*1.006 = 50400.6 . So I don't understand how you get 50422.344909. — Ismael EL ATIFI, Aug 28 '20 at 09:09
probably write the code in numpy and use numba to speed it up — sammywemmy, Aug 28 '20 at 09:59
@IsmaelELATIFI that is because I rounded MONTHLY_INVESTMENT_RETURN to 3 decimal points for the sake of simplicity. The actual number is 0.00643403011000343. My apologies for the confusion however the core of the question remains. — darkpool, Aug 28 '20 at 10:54
Ok I see. By the way there is no point in vectorizing here because your calculations are sequential i.e. you need to compute for row N to be able to compute for row N+1. So because it cannot be parrallelized this is not something worth to vectorize. — Ismael EL ATIFI, Aug 28 '20 at 11:21
Is the base_contribution constant or can it vary. If constant, the way can be shorter than the loop, calculating the return of the money of the first month and then use cumsum — thomas, Aug 28 '20 at 12:10

score 0 · Accepted Answer · answered Aug 28 '20 at 09:32

According to this post, it seems it's not feasible

You can save an if in each loop. And df.at is also a speeder way to set a value in a dataframe.

balance = df.loc[0, 'base_contribution'] + df.loc[0, 'base_balance']
balance += balance * MONTHLY_INVESTMENT_RETURN
df.at[0, 'base_balance'] = balance

for index, row in df[1:].iterrows():
    balance = row['base_contribution'] + df.loc[row.name - 1, 'base_balance']
    balance += balance * MONTHLY_INVESTMENT_RETURN
    df.at[index, 'base_balance'] = balance

I've found interesting methods : rolling, cumsum and expanding. But nothing works here, due to the fact that we do not know the values of base_contribution on start.

score 0 · Answer 2 · answered Aug 28 '20 at 12:26

Under the assumption that the base_balance is constant.

With one intermediate step you can do what you want: Keep in mind that you can split the investment in parts and calculate the return on each part.

So the resulting value of the starting balance (CURRENT_BALANCE ) at the time of the nth month can be written as:

df["result_on_start_investment"] =  CURRENT_BALANCE * math.pow(MONTHLY_INVESTMENT_RETURN,  np.arange(len(df)) + 1)

Each month an additional amount is added to the investment. This amount gets its return every month. As a first step calculate

df["result_on_added_at_month_one"] = base_balance * math.pow(MONTHLY_INVESTMENT_RETURN,  np.arange(len(df)) + 1)

and finally as the revenue of added money at month n-1 is equal to the revenue of money added at month 1 at month 2:

df["balance"] = df["result_on_start_investment"] + df["result_on_added_at_month_one"].cumsum()

Result:

month  base_expenses  base_contribution  n  result_on_added_at_month_one  result_on_start_investment       balance
0      0          200.0              100.0  1                    100.643403                50321.701506  50422.344909
1      1          200.0              100.0  2                    101.290946                50645.472848  50847.407197
2      2          200.0              100.0  3                    101.942655                50971.327345  51275.204349
3      3          200.0              100.0  4                    102.598557                51299.278400  51705.753960
4      4          200.0              100.0  5                    103.258679                51629.339502  52139.073741

Calculate cumulative column with calculation

2 Answers2