How to vectorize in Pandas when values depend on prior values

Question

I'd like to use Pandas to implement a function that keeps a running balance, but I'm not sure it can be vectorized for speed.

In short, the problem I'm trying to solve is to keep track consumption, generation, and the "bank" of over-generation.

"consumption" means how much is used in a given time period.
"generation" is how much is generated.
When generation is greater than consumption then the homeowner can "bank" the extra generation, to be applied in subsequent time periods. they can apply it if their consumption exceeds their generation for a later month.
This will be for many entities, hence the "id" field. The time sequence is defined by "order"

Very basic example:

Month 1 generates 13 consumes 8 -> therefore banks 5
month 2 generates 8 consumes 10 -> therefore uses 2 from the the bank, and still has 3 left over
Month 3 generates 7 consumes 20 -> exhausts remaining 3 from bank, and has no bank left over.

Code import numpy as np import pandas as pd

id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 100, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)), 
       columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,91,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,8,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)), 
       columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])

def bank(df):
    # deposit all excess when generation exceeds consumption
  deposit = (df['Generate'] > df['Consume']) * (df['Generate'] - df['Consume'])
  df['end_bal'] = 0

  # beginning balance = prior period ending balance
  df = df.sort_values(by=['id', 'Order'])
  df['begin_bal'] = df['end_bal'].shift(periods=1)
  df.loc[df['Order']==1, 'begin_bal'] = 0  # set first month beginning balance of each customer to 0

  # calculate withdrawal
  df['Withdraw'] = 0
  ok_to_withdraw = df['Consume'] > df['Generate']
  df.loc[ok_to_withdraw,'Withdraw'] = np.minimum(df.loc[ok_to_withdraw, 'begin_bal'],
                                               df.loc[ok_to_withdraw, 'Consume'] -
                                               df.loc[ok_to_withdraw, 'Generate'] -
                                               deposit[ok_to_withdraw])
  # ending balance = beginning balance + deposit - withdraw
  df['end_bal'] = df['begin_bal'] + deposit - df['Withdraw'] 
  return df

df = bank(df)
df.head()
    id  Order   Consume Generate    end_bal begin_bal   Withdraw
0   1   1       10      20          10.0    0.0         0.0
1   1   2       17      16          0.0     0.0         0.0
2   1   3       20      17          0.0     0.0         0.0
3   1   4       11      21          10.0    0.0         0.0
4   1   5       17      9           0.0     0.0         0.0

df_solution.head()

    id  Order   Consume Generate    begin_bal   end_bal Withdraw
0   1   1       10      20          0           10      0
1   1   2       17      16          10          9       1
2   1   3       20      17          9           6       3
3   1   4       11      21          6           16      0
4   1   5       17      9           16          8       9

I tried to implement with various iterations of cumsum and shift . . . but the fact remains that value of each row seems like it needs to be recalculated based on the prior row, and I'm not sure this is possible to vectorize.

Code to generate some test datasets:

def generate_testdata():
  random.seed(42*42)
  np.random.seed(42*42)
  numids = 10
  numorders = 12
  id = []
  order = []
  for i in range(numids):
    id = id + [i]*numorders
    order = order + list(range(1,numorders+1))
  consume = np.random.uniform(low = 10, high = 40, size = numids*numorders)
  generate = np.random.uniform(low = 10, high = 40, size = numids*numorders)
  df = pd.DataFrame(list(zip(id, order, consume, generate)), 
           columns =['id','Order','Consume', 'Generate'])
  return df

the above code is vectorized but it doesn't return the right result -- just added the actual output vs. desired output to the code snippet — Jonathan, Jun 14 '19 at 16:31
*therefore uses 2 from the the bank, and still has 3 left over* is this your deposit? — Quang Hoang, Jun 14 '19 at 16:36
i'm using the term "bank" to represent the running balance of deposits. You can only "deposit" when you generate more than you consume. So we "deposit" 5 in the "basic example" above. you only "withdraw" when you have a balance from which to withdraw, and for that month you consumed more than you generated. So . . . your ending balance is always equal to your beginning balance, plus your deposit (if you had any) minus your withdrawal (if you had any), and it can never be below 0 — Jonathan, Jun 14 '19 at 16:45

Paul Panzer · Accepted Answer · 2019-06-19T22:03:29.253

Here is a numpy-ish approach, mostly because I'm not that familiar with pandas:

The idea is to first compute the free cumsum and then to subtract the cumulative minimum if it is negative.

import numpy as np
import pandas as pd

id = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2,2]
order = [1,2,3,4,5,6,7,8,9,18,11,12,13,14,15,1,2,3,4,5,6,7,8,9,10,11]
consume = [10, 17, 20, 11, 17, 19, 20, 10, 10, 19, 14, 12, 10, 14, 13, 19, 12, 17, 12, 18, 15, 14, 15, 20, 16, 15]
generate = [20, 16, 17, 21, 9, 13, 10, 16, 12, 10, 9, 9, 15, 13, 8, 15, 18, 16, 10, 16, 12, 12, 13, 20, 10, 15]
df = pd.DataFrame(list(zip(id, order, consume, generate)), 
           columns =['id','Order','Consume', 'Generate'])
begin_bal = [0,10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0]
end_bal = [10,9,6,16,8,2,0,6,8,0,0,0,5,4,0,0,6,5,3,1,0,0,0,0,0,0]
withdraw = [0,1,3,0,9,6,2,0,0,8,0,0,0,1,4,0,0,1,2,2,1,0,0,0,0,0]
df_solution = pd.DataFrame(list(zip(id, order, consume, generate, begin_bal, end_bal, withdraw)), 
           columns =['id','Order','Consume', 'Generate', 'begin_bal', 'end_bal', 'Withdraw'])

def f(df):
    # find block bondaries
    ids = df["id"].values
    bnds, = np.where(np.diff(ids, prepend=ids[0]-1, append=ids[-1]+1))
    # find raw balance change
    delta = (df["Generate"] - df["Consume"]).values
    # find offset, so cumulative min does not interfere across ids
    safe_total = (np.minimum(delta.min(), 0)-1) * np.diff(bnds[:-1])
    # must apply offset just before group switch, so it aligns the first
    # begin_bal, not end_bal, of the next group
    # also keep a copy of original values at switches
    delta_orig = delta[bnds[1:-1]-1]
    delta[bnds[1:-1]-1] += safe_total - np.add.reduceat(delta, bnds[:-2])
    # form free cumsum
    acc = delta.cumsum()
    # correct
    acc -= np.minimum(0, np.minimum.accumulate(acc))
    #  write solution back to df
    shft = np.empty_like(acc)
    shft[1:] = acc[:-1]
    shft[0] = 0
    # reinstate last end_bal of each group
    acc[bnds[1:-1]-1] = np.maximum(0, shft[bnds[1:-1]-1] + delta_orig)
    df["begin_bal"] = shft
    df["end_bal"] = acc
    df["Withdraw"] = np.maximum(0, df["begin_bal"] - df["end_bal"])

Test:

f(df)
df == df_solution

Prints:

      id  Order  Consume  Generate  begin_bal  end_bal  Withdraw
0   True   True     True      True       True     True      True
1   True   True     True      True       True     True      True
2   True   True     True      True       True     True      True
3   True   True     True      True       True     True      True
4   True   True     True      True       True     True     False
5   True   True     True      True       True     True      True
6   True   True     True      True       True     True      True
7   True   True     True      True       True     True      True
8   True   True     True      True       True     True      True
9   True   True     True      True       True     True      True
10  True   True     True      True       True     True      True
11  True   True     True      True       True     True      True
12  True   True     True      True       True     True      True
13  True   True     True      True       True     True      True
14  True   True     True      True       True     True      True
15  True   True     True      True       True     True      True
16  True   True     True      True       True     True      True
17  True   True     True      True       True     True      True
18  True   True     True      True       True     True      True
19  True   True     True      True       True     True      True
20  True   True     True      True       True     True      True
21  True   True     True      True       True     True      True
22  True   True     True      True       True     True      True
23  True   True     True      True       True     True      True
24  True   True     True      True       True     True      True
25  True   True     True      True       True     True      True

There is one False but that appears to be a typo in the expected output provided.

This solution is great and very fast. Thanks! And you are correct, the "False" value on the check is a false negative -- there is a typo in the expected output. — Jonathan, Jun 18 '19 at 20:41
@Jonathan glad it works for you. One little caution: If you intend to use this on very large data sets you should add a check for underflow in `safe_total` — Paul Panzer, Jun 18 '19 at 20:57
thanks for the heads up! after taking a closer look it appears as though the first "end_bal" for every id is 0. So it's not working when the first order (order=1) generates more than it consumes. If I can find a tweak to fix that I'll let you know so you can update the solution here for posterity. — Jonathan, Jun 19 '19 at 00:05
@Jonathan Good catch! I think I fixed it now, but didn't test extensively. — Paul Panzer, Jun 19 '19 at 04:30
Looks like end_bal is now set to zero for the last record of each id -- which affects the last record Withdraw as well. I will add a way to generate test datasets to the original question — Jonathan, Jun 19 '19 at 16:20
@Jonathan the line below the comment `# reinstate last end_bal of each group` is intended to fix that, you didn't miss copying that line by any chance? — Paul Panzer, Jun 19 '19 at 21:10
i'm pretty sure I copied and implemented your code as-is, but I've certainly been known to make mistakes. I just updated the test code to show the situation where we generate 100 (and consume only 13) on the final month -- meaning that we will have an ending balance left over (of 87 if I did the math correct). The algorithm shows 0 for the ending balance. — Jonathan, Jun 19 '19 at 21:42
@Jonathan you are right. I tried undoing the reset using `delta` which had the reset already baked in. I hope it is fixed now. Sorry for wasting your time. — Paul Panzer, Jun 19 '19 at 22:06
works like a charm. wish I could buy you a beverage of your choice. Thanks! And it's really fast. For 10,000 distinct ids, each with an order=24 these are the stats: 29.4 ms ± 695 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) — Jonathan, Jun 19 '19 at 22:27

score 3 · Answer 2 · answered Jun 17 '19 at 20:40

Using @PaulPanzer's logic here is a pandas version.

def CalcEB(x):
    delta = x['Generate'] - x['Consume']
    return delta.cumsum() - delta.cumsum().cummin().clip(-np.inf,0)

df['end_bal'] = df.groupby('id', as_index=False).apply(CalcEB).values
df['begin_bal'] = df.groupby('id')['end_bal'].shift().fillna(0)
df['Withdraw'] = (df['begin_bal'] - df['end_bal']).clip(0,np.inf)

df_pandas = df.copy()

#Note the typo mentioned by Paul Panzer
df_pandas.reindex(df_solution.columns, axis=1) == df_solution

Output (check dataframes)

      id  Order  Consume  Generate  begin_bal  end_bal  Withdraw
0   True   True     True      True       True     True      True
1   True   True     True      True       True     True      True
2   True   True     True      True       True     True      True
3   True   True     True      True       True     True      True
4   True   True     True      True       True     True     False
5   True   True     True      True       True     True      True
6   True   True     True      True       True     True      True
7   True   True     True      True       True     True      True
8   True   True     True      True       True     True      True
9   True   True     True      True       True     True      True
10  True   True     True      True       True     True      True
11  True   True     True      True       True     True      True
12  True   True     True      True       True     True      True
13  True   True     True      True       True     True      True
14  True   True     True      True       True     True      True
15  True   True     True      True       True     True      True
16  True   True     True      True       True     True      True
17  True   True     True      True       True     True      True
18  True   True     True      True       True     True      True
19  True   True     True      True       True     True      True
20  True   True     True      True       True     True      True
21  True   True     True      True       True     True      True
22  True   True     True      True       True     True      True
23  True   True     True      True       True     True      True
24  True   True     True      True       True     True      True
25  True   True     True      True       True     True      True

score 1 · Answer 3 · answered Jun 14 '19 at 20:07

I am not sure I understood your question fully, but I am going to give a go at answering. I will re-phrase what I understood...

1. Source data

There is source data, which is a DataFrame with four columns:

id - ID number of an entity
order - indicates the sequence of periods
consume - how much was consumed during the period
generate - how much was generated during the period

2. Calculations

For each id, we want to calculate:

diff which is the difference between generate and consume for each period
opening balance which is the closing balance from the previous order
closing balance which is the cumulative sum of the diff

3. Code

I will try to solve this with groupby, cumsum and shift.

# Make sure the df is sorted
df = df.sort_values(['id','order'])
df['diff'] = df['generate'] - df['consume'] 
df['closing_balance'] = df.groupby('id')['diff'].cumsum()
# Opening balance equals the closing balance from the previous period
df['opening_balance'] = df.groupby('id')['closing_balance'].shift(1)

I definitely misunderstood something, feel free to correct me and I will try to come up with a better answer.
In particular, I wasn't sure how to handle the closing_balance going into negative numbers. Should it show negative balance? Should it nullify the "debts"?

That almost works, but the balance can never go negative. That's the crux of the problem. If I could pass parameters to cumsum such that the cumulative sum can never go below zero then I'd be set. — Jonathan, Jun 17 '19 at 13:59
In that case I think it can be solved using the answers from [this thread](https://stackoverflow.com/questions/18196811/cumsum-reset-at-nan). — pnovotnyq, Jun 17 '19 at 15:23

How to vectorize in Pandas when values depend on prior values

3 Answers3

1. Source data

2. Calculations

3. Code