2

I am looking to build a function that creates a rolling total by code by day for the below DataFrames, where the In for each code on a date is subtracted from the Out for each code on a date, this subtotal is the subtracted from the previous days total but the total must be >=0 ( I have included an example of this it the Desired Output below).

Below are an example of my inputs and the function I am using along with an example of my desired output.

df1 - In

s = """        Date    Code    Quantity
0   10/01/2019  A   20
3   11/01/2019  A   2
7   12/01/2019  A   4
11  13/01/2019  A   10
"""

df2 - Out

s ='''    Date     Code   Quantity
0   11/01/2019  A   5
3   12/01/2019  A   100
4   15/01/2019  A   1
6   16/01/2019  A   2
'''

Code

df3 = df1.merge(df2, how='outer', left_on=['date', 'code'], right_on=['date', 'code']).fillna(0)
df3['qty1'] = df3['qty_x'] - df3['qty_y']
df3['qty'] = 0
def final_adder(x):
    x.qty_x = x.qty_x
    print(x)
    return x
df_final = df3.groupby(['code']).apply(final_adder)
df_final['qty'] = df_final['qty'].clip(lower=0)
df_final.drop(['qty_x', 'qty_y','qty1'], inplace=True, axis=1)


          date code  qty_x  qty_y qty1  qty
0   10/01/2019   A   20.0    0.0  20.0    0
3   11/01/2019   A    2.0    5.0  -3.0    0
7   12/01/2019   A    4.0  100.0 -96.0    0
11  13/01/2019   A   10.0    0.0  10.0    0

Desired Output

s = """        Date    Code    Quantity
0   10/01/2019  A   20
3   11/01/2019  A   17
7   12/01/2019  A   0
11  13/01/2019  A   10
12  14/01/2019  A   10
15  15/01/2019  A   9
16  16/01/2019  A   7
"""
CTD91
  • 111
  • 9

1 Answers1

2

There is a whole sub-genre of questions on SO involving cumulative operations with limits (e.g.: "reset to zero when the cumulative sum becomes negative"). This is not the same as cumulative operation with known reset points (e.g. from another column, or where there are NaNs, etc.) because the condition involves the cumulative value itself.

There is no clean way in the current pandas or numpy to do that in a vectorized way.

The best (fastest) way I am aware of for this is this SO answer involving numba. Slightly modified and adapted to your problem:

from numba import njit
@njit
def poscumsum(x):
    total = 0
    result = np.empty(x.shape)
    for i, y in enumerate(x):
        total += y
        if total < 0:
            total = 0
        result[i] = total
    return result

With this, you can do:

a = df1.set_index(['Code', 'Date'])
b = df2.set_index(['Code', 'Date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
# optional: resample Date to daily within each group:
df3 = df3.groupby('Code').resample('D', level='Date').sum()
df3['Quantity'] = df3.groupby('Code')['Quantity'].transform(
    lambda g: poscumsum(g.values))

On the data provided in the question:

>>> df3
                 Quantity
Code Date                
A    2019-01-10        20
     2019-01-11        17
     2019-01-12         0
     2019-01-13        10
     2019-01-14        10
     2019-01-15         9
     2019-01-16         7

If you prefer, you can also use merge. Here is an example where all the intermediary results are kept (for forensic analysis):

df3 = df1.merge(df2, on=['Code', 'Date'], how='outer', sort=True).fillna(0)
# optional: resample Date to daily within each group:
df3 = df3.set_index(['Code', 'Date']).groupby('Code').resample('D', level='Date').sum()
df3['diff'] = df3['Quantity_x'] - df3['Quantity_y']
df3['cumdiff'] = df3.groupby('Code')['diff'].transform(
   lambda g: poscumsum(g.values))

df3
# out:
                 Quantity_x  Quantity_y  diff  cumdiff
Code Date                                             
A    2019-01-10        20.0         0.0  20.0     20.0
     2019-01-11         2.0         5.0  -3.0     17.0
     2019-01-12         4.0       100.0 -96.0      0.0
     2019-01-13        10.0         0.0  10.0     10.0
     2019-01-14         0.0         0.0   0.0     10.0
     2019-01-15         0.0         1.0  -1.0      9.0
     2019-01-16         0.0         2.0  -2.0      7.0
Pierre D
  • 24,012
  • 7
  • 60
  • 96
  • Thanks for this it works perfectly when I have a single 'code' however if I introduce a second 'code', 'B', I am getting the following error: ValueError: cannot handle a non-unique multi-index! – CTD91 Dec 22 '20 at 18:04
  • not sure what you mean by "if I introduce second code". I tested with multiple codes mixed together and it works just fine (it excludes any `code, date` not in `df1` though). You may want to update your question with extra examples (ones where the current proposed answer doesn't fit your needs). – Pierre D Dec 22 '20 at 18:35
  • I have fixed the error now, I have just edited my input data for df2 and my desired output slightly. Where a code has a 'qty' I am looking for it to appear on each future day even if it doesn't appear in the in or out DataFrames on each day. I think the updated question now shows this. I hope that makes sense. – CTD91 Dec 22 '20 at 20:13
  • 1
    I see: you want the whole outer join. No problem. Updated answer. – Pierre D Dec 22 '20 at 23:32
  • Thanks that looks to be working much better now the only question I have remaining is that the 14-01-19 where there are no In's or Outs I was hoping would still appear in my output with the same cumdiff as the previous day (in this case 10). Is this possible? – CTD91 Dec 23 '20 at 09:06
  • 1
    certainly, see the "optional resampling" in the updated answer: just insert a daily resampling of Date for each Code group, _just before_ doing the cumulative sum: `df3 = df3.groupby('Code').resample('D', level='Date').sum()`. – Pierre D Dec 23 '20 at 13:42
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226352/discussion-between-ctd91-and-pierre-d). – CTD91 Dec 23 '20 at 14:31