0

I'm having trouble coming up with a way to perform a cumulative sum on a column and creating a flag once it hits a certain value.

So given a dataframe:

df = pd.DataFrame([[5,1],[6,1],[30,1],[170,0],[5,1],[10,1]],columns = ['a','b'])

     a  b
0    5  1
1    6  1
2   30  1
3  170  0
4    5  1
5   10  1

For column A, I want to perform the cumulative sum and set the "Flag" column value to 1 if that max value is reached. Upon reaching that max value, it will reset to 0. In this case, the max value is 40. Any cumulative sum over 40 will trigger the reset

Desired Output

     a  b  Flag
0    5  1     0
1   11  1     0
2   41  1     1
3  170  0     1
4    5  1     0
5   15  1     0

Any help would be appreciated!

Jihong
  • 127
  • 6
  • 3
    this isnt possible by vectorization I think: look at [Restart cumsum and get index if cumsum more than value](https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value#56904650) – anky Feb 19 '21 at 16:19
  • Got it. Thank you for the reference! – Jihong Feb 19 '21 at 16:50

1 Answers1

1

"Ordinary" cumsum() is here useless, as this function "doesn't know" where to restart summation.

You can do it with the following custom function:

def myCumSum(x, thr):
    if myCumSum.prev >= thr:
        myCumSum.prev = 0
    myCumSum.prev += x
    return myCumSum.prev

This function is "with memory" (from the previous call) - prev, so there is a way to "know" where to restart.

To speed up the execution, define a vectorized version of this function:

myCumSumV = np.vectorize(myCumSum, otypes=[np.int], excluded=['thr'])

Then execute:

threshold = 40
myCumSum.prev = 0  # Set the "previous" value
# Replace "a" column with your cumulative sum
df.a = myCumSumV(df.a.values, threshold)
df['flag'] = df.a.ge(threshold).astype(int)  # Compute "flag" column

The result is:

     a  b  flag
0    5  1     0
1   11  1     0
2   41  1     1
3  170  0     1
4    5  1     0
5   15  1     0
Valdi_Bo
  • 30,023
  • 4
  • 23
  • 41