Python Pandas groupby limited cumulative sum

Question

This is my dataframe

import pandas as pd
import numpy as np

data = {'c1':[-1,-1,1,1,np.nan,1,1,1,1,1,np.nan,-1],\
        'c2':[1,1,1,-1,1,1,-1,-1,1,-1,1,np.nan]}

index = pd.date_range('2000-01-01','2000-03-20', freq='W')

df = pd.DataFrame(index=index, data=data)


>>> df
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -1.0  1.0
2000-01-16  1.0  1.0
2000-01-23  1.0 -1.0
2000-01-30  NaN  1.0
2000-02-06  1.0  1.0
2000-02-13  1.0 -1.0
2000-02-20  1.0 -1.0
2000-02-27  1.0  1.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  1.0
2000-03-19 -1.0  NaN

and this is a cumulative sum by month

df2 = df.groupby(df.index.to_period('m')).cumsum()

>>> df2
             c1   c2
2000-01-02 -1.0  1.0
2000-01-09 -2.0  2.0
2000-01-16 -1.0  3.0
2000-01-23  0.0  2.0
2000-01-30  NaN  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0 -1.0
2000-02-27  4.0  0.0
2000-03-05  1.0 -1.0
2000-03-12  NaN  0.0
2000-03-19  0.0  NaN

what I need more is to ignore the increment if it is more than 3 or less than 0, something like this function

def cumsum2(arr, low=-float('Inf'), high=float('Inf')):
    arr2 = np.copy(arr)
    sm = 0
    for index, elem in np.ndenumerate(arr):
        if not np.isnan(elem):
            sm += elem
            if sm > high:
                sm = high
            if sm < low:
                sm = low
        arr2[index] = sm
    return arr2

the desired result is

             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

I tried to use apply and lambda but doesn't work and it's slow for large dataframe.

df.groupby(df.index.to_period('m')).apply(lambda x: cumsum2(x, 0, 3))

What's wrong? Is there a faster way?

Is the issue still relevant? I wanted to suggest a working solution. — RomanPerekhrest, Jan 30 '23 at 10:04

score 2 · Answer 1 · answered Jan 28 '23 at 23:13

You can try accumulate from itertools and use a custom function to clip values between 0 and 3:

from itertools import accumulate

lb = 0  # lower bound
ub = 3  # upper bound

def cumsum2(dfm):
    def clip(bal, val):
        return np.clip(bal + val, lb, ub)
    return list(accumulate(dfm.to_numpy(), clip, initial=0))[1:]

out = df.fillna(0).groupby(df.index.to_period('m')).transform(cumsum2)

Output:

>>> out
             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

Thanks for your hint but I get 2 errors TypeError: accumulate() takes at most 2 arguments (3 given) TypeError: Transform function invalid for data types I try to update my modules. — dan577, Jan 29 '23 at 21:17

score 1 · Answer 2 · answered Jan 30 '23 at 11:07

In such sophisticated case we can resort to pandas.Series.rolling with window of size 2 piping each window to a custom function to keep each interim accumulation within a certain threshold:

def cumsum_tsh(x, low=-float('Inf'), high=float('Inf')):
    def f(w):
        w[-1] = min(high, max(low, w[0] if w.size == 1 else w[0] + w[1]))
        return w[-1]
    return x.apply(lambda s: s.rolling(2, min_periods=1).apply(f))

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
    .apply(lambda x: cumsum_tsh(x, 0, 3))

             c1   c2
2000-01-02  0.0  1.0
2000-01-09  0.0  2.0
2000-01-16  1.0  3.0
2000-01-23  2.0  2.0
2000-01-30  2.0  3.0
2000-02-06  1.0  1.0
2000-02-13  2.0  0.0
2000-02-20  3.0  0.0
2000-02-27  3.0  1.0
2000-03-05  1.0  0.0
2000-03-12  1.0  1.0
2000-03-19  0.0  1.0

score 0 · Answer 3 · answered Feb 01 '23 at 16:06

I've tried various solutions, for some reason the fastest is manipulating single columns of frames created by groupby. This is the code if it can be useful to anyone

def cumsum2(frame, low=-float('Inf'), high=float('Inf')):
    for col in frame.columns:
        sm = 0
        xs = []
        for e in frame[col]:
            sm += e
            if sm > high:
                sm = high
            if sm < low:
                sm = low
            xs.append(sm)
        frame[col] = xs
    return frame

res = df.fillna(0).groupby(df.index.to_period('m'), group_keys=False)\
                                            .apply(cumsum2,0,3)

Python Pandas groupby limited cumulative sum

3 Answers3

Linked