How to resample until a specific date criteria is met

Question

With some help from the community I have managed to get to the below function. previous question on building the functionI am trying to work out how to get the resampled date to run to the latest date that appears in anywhere in either of the input data sets for any code. Below I have included the current output I am getting and my desired output.

Input data:

Input 1 df1 - In

    date       code qty
0   2019-01-10  A   20
1   2019-01-10  B   12
2   2019-01-10  C   10
3   2019-01-11  A   2
4   2019-01-11  B   30
5   2019-01-11  C   2
7   2019-01-12  A   4
8   2019-01-12  B   6
11  2019-01-13  A   10
12  2019-01-13  B   12
13  2019-01-13  C   1

Input 2 df2 - Outbound

    date       code qty
0   2019-01-11  A   5
1   2019-01-11  B   1
2   2019-01-11  C   3
3   2019-01-12  A   100
6   2019-01-13  B   1
7   2019-01-13  C   1
8   2019-01-15  A   1
9   2019-01-16  B   1

Existing Code:

from numba import njit
@njit
def poscumsum(x):
    total = 0
    result = np.empty(x.shape)
    for i, y in enumerate(x):
        total += y
        if total < 0:
            total = 0
        result[i] = total
    return result

a = df1.set_index(['code', 'date'])
b = df2.set_index(['code', 'date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
df3 = df3.groupby('code').resample('D', level='date').sum()
df3['qty'] = df3.groupby('code')['qty'].transform(
    lambda g: poscumsum(g.values))

Current Output

each code is only represented for dates on which they appear in the In or Out dfs.

      code  date        qty
    0   A   2019-01-10  20
    1   A   2019-01-11  17
    2   A   2019-01-12  0
    3   A   2019-01-13  10
    4   A   2019-01-14  10
    5   A   2019-01-15  9
    6   B   2019-01-10  12
    7   B   2019-01-11  41
    8   B   2019-01-12  47
    9   B   2019-01-13  58
    10  B   2019-01-14  58
    11  B   2019-01-15  58
    12  B   2019-01-16  57
    13  C   2019-01-10  10
    14  C   2019-01-11  9
    15  C   2019-01-12  9
    16  C   2019-01-13  9

Desired Output:

each code is represented for each date between 2019-01-10 & 2019-01-16

       code date        qty
    0   A   2019-01-10  20
    1   A   2019-01-11  17
    2   A   2019-01-12  0
    3   A   2019-01-13  10
    4   A   2019-01-14  10
    5   A   2019-01-15  9
    6   A   2019-01-16  9
    7   B   2019-01-10  12
    8   B   2019-01-11  41
    9   B   2019-01-12  47
    10  B   2019-01-13  58
    11  B   2019-01-14  58
    12  B   2019-01-15  58
    13  B   2019-01-16  57
    14  C   2019-01-10  10
    15  C   2019-01-11  9
    16  C   2019-01-12  9
    17  C   2019-01-13  9
    18  C   2019-01-14  9
    19  C   2019-01-15  9
    20  C   2019-01-16  9

your input data only goes to the 13th while desired output goes to the 16th — David Erickson, Dec 23 '20 at 22:54
@DavidErickson There are 2 input dfs 1 runs to the 13th and 1 to the 16th — CTD91, Dec 23 '20 at 22:56
okay `df2 - Out` is very confusing :) I thought it was expected output, but `df2 - Out` is actually input data. — David Erickson, Dec 23 '20 at 22:57
@DavidErickson I have updated my question hopefully it is a bit clearer now :) — CTD91, Dec 23 '20 at 22:59
@DaniMesejo the desired output is a rolling total of df1 - df2 using the output from the previous day as the base but the output cannot be 0. In this case A would have been 17 on the 11-01 and df1 - df2 for the 12-01 would be -81 which is set to 0 to avoid the negative value. Hope this makes sense :) — CTD91, Dec 23 '20 at 23:10
you could add a reference to your previous question... (Just to save everyone's time) — Pierre D, Dec 24 '20 at 00:17
@PierreD I have updated my question with a link to the original question. I hope this helps. — CTD91, Dec 24 '20 at 00:25
hey, no worries -- see a version that goes 2D (since all dates will be the same) and has a 2D version of `cumsum_capped()`. — Pierre D, Dec 24 '20 at 00:42

Pierre D · Accepted Answer · 2020-12-24T01:09:37.453

Ok, here is a 2D version of poscumsum (and generalized to cap the running sum at min and/or max):

@njit
def cumsum_capped_2d(x, xmin=None, xmax=None):
    n, m = x.shape
    result = np.empty_like(x)
    if n == 0:
        return result
    total = np.zeros_like(x[0])
    for i in range(n):
        total += x[i]
        if xmin is not None:
            total[total < xmin] = xmin
        if xmax is not None:
            total[total > xmax] = xmax
        result[i] = total
    return result

And here is how to use it (now that you want all dates spanning the same period); the good news is that there is no more groupby (so it is faster than ever):

a = df1.pivot('date', 'code', 'qty')
b = df2.pivot('date', 'code', 'qty')
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0)).resample('D').sum()
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)

Or, in two (convoluted) lines:

df3 = (df1.set_index(['date', 'code']).subtract(df2.set_index(['date', 'code']), fill_value=0)
       .unstack('code', fill_value=0).resample('D').sum())
df3.values[:, :] = cumsum_capped_2d(df3.values, xmin=0)

On your data:

>>> df3
code           A     B     C
date                        
2019-01-10  20.0  12.0  10.0
2019-01-11  17.0  41.0   9.0
2019-01-12   0.0  41.0   9.0
2019-01-13   0.0  52.0   9.0
2019-01-14   0.0  52.0   9.0
2019-01-15   0.0  52.0   9.0
2019-01-16   0.0  51.0   9.0

Of course, you are free stack back into a skinny df, re-order, drop index, etc. For example, to match your desired output:

>>> df3.stack().swaplevel(0,1).sort_index().reset_index()
   code       date   qty
0     A 2019-01-10  20.0
1     A 2019-01-11  17.0
2     A 2019-01-12   0.0
3     A 2019-01-13  10.0
4     A 2019-01-14  10.0
5     A 2019-01-15   9.0
6     A 2019-01-16   9.0
7     B 2019-01-10  12.0
8     B 2019-01-11  41.0
9     B 2019-01-12  47.0
10    B 2019-01-13  58.0
11    B 2019-01-14  58.0
12    B 2019-01-15  58.0
13    B 2019-01-16  57.0
14    C 2019-01-10  10.0
15    C 2019-01-11   9.0
16    C 2019-01-12   9.0
17    C 2019-01-13   9.0
18    C 2019-01-14   9.0
19    C 2019-01-15   9.0
20    C 2019-01-16   9.0

David Erickson · Answer 2 · 2020-12-24T09:01:42.543

Here is another approach using reindex. You can generate a date_range of unique values per day across all groups called dates. Then, get the unique codes and create a mutli-index to reindex by with pd.MultiIndex.from_product(). Then, reindex and forward fill with ffill():

d = pd.to_datetime(df3.index.get_level_values(1))
dates = pd.date_range(d.min(), d.max(), freq= '1d')
codes = df3.index.get_level_values(0).unique()
idx = pd.MultiIndex.from_product([codes, dates], names=['date', 'code'])
df3 = df3.reindex(idx).reset_index().ffill()

Full code and output:

# original code 
from numba import njit
@njit
def poscumsum(x):
    total = 0
    result = np.empty(x.shape)
    for i, y in enumerate(x):
        total += y
        if total < 0:
            total = 0
        result[i] = total
    return result

df1['date'] = pd.to_datetime(df1['date'])
df2['date'] = pd.to_datetime(df2['date'])
a = df1.set_index(['code', 'date'])
b = df2.set_index(['code', 'date'])
idx = a.index.union(b.index).sort_values()
df3 = (a.reindex(idx, fill_value=0) - b.reindex(idx, fill_value=0))
df3 = df3.groupby('code').resample('D', level='date').sum()
df3['qty'] = df3.groupby('code')['qty'].transform(
    lambda g: poscumsum(g.values))

#code I added
d = pd.to_datetime(df3.index.get_level_values(1))
dates = pd.date_range(d.min(), d.max(), freq= '1d')
codes = df3.index.get_level_values(0).unique()
idx = pd.MultiIndex.from_product([codes, dates], names=['date', 'code'])
df3 = df3.reindex(idx).reset_index().ffill()
df3

Out[1]: 
   date       code  qty
0     A 2019-01-10 20.0
1     A 2019-01-11 17.0
2     A 2019-01-12  0.0
3     A 2019-01-13 10.0
4     A 2019-01-14 10.0
5     A 2019-01-15  9.0
6     A 2019-01-16  9.0
7     B 2019-01-10 12.0
8     B 2019-01-11 41.0
9     B 2019-01-12 47.0
10    B 2019-01-13 58.0
11    B 2019-01-14 58.0
12    B 2019-01-15 58.0
13    B 2019-01-16 57.0
14    C 2019-01-10 10.0
15    C 2019-01-11  9.0
16    C 2019-01-12  9.0
17    C 2019-01-13  9.0
18    C 2019-01-14  9.0
19    C 2019-01-15  9.0
20    C 2019-01-16  9.0

Just noticed “Date” and “code” are swapped on the naming :). I will fix when at computer. — David Erickson, Dec 24 '20 at 09:03

How to resample until a specific date criteria is met

2 Answers2

Linked