Perfrom cumulative sum over a column but reset to 0 if sum become negative in Pandas

Question

I have a pandas dataframe with two columns like this,

Item    Value
0   A   7
1   A   2
2   A   -6
3   A   -70
4   A   8
5   A   0

I want to cumulative sum over the column, Value. But while creating the cumulative sum if the value becomes negative I want to reset it back to 0.

I am currently using a loop shown below to perform this,

sum_ = 0
cumsum = []

for val in sample['Value'].values:
    sum_ += val
    if sum_ < 0:
        sum_ = 0
    cumsum.append(sum_)

print(cumsum) # [7, 9, 3, 0, 8, 8]

I am looking for a more efficient way to perform this in pure pandas.

I was thinking the same and finnally settled with the solution with the loop i posted in the question. I was wondering I am missing out some pandas trick that could do the magic — Sreeram TP, Aug 15 '19 at 13:56
What you did is more like what I can offer, only little different I may using numba — BENY, Aug 15 '19 at 13:57
I am not familiar with numba. How much improvement of performance (in terms of time) can I expect.? If you can post the code as answer I will check for myself and let you know whether it will be good for me. — Sreeram TP, Aug 15 '19 at 13:58
https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value/56904899#56904899 you can get some solution from there , even that is not 100% same — BENY, Aug 15 '19 at 14:10
Great question, posted this as an [improvement suggestion](https://github.com/pandas-dev/pandas/issues/27935) to `pandas` on GitHub @WeNYoBen — Erfan, Aug 15 '19 at 14:39
@Erfan maybe it is better adding the upper and lower :-) like clip — BENY, Aug 15 '19 at 14:44

BENY · Accepted Answer · 2019-08-15T14:51:13.753

8

Slightly modify also this method is slow that numba solution

sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)

numba solution

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]

edited Aug 15 '19 at 14:51

answered Aug 15 '19 at 14:25

BENY

317,841
20
164
234

Awsome, works as expected. I will get back to you after testing the time. – Sreeram TP Aug 15 '19 at 14:28
Was this method ever tested against the solutions of divakar and pirsquared? Just wondering in terms of speed – Erfan Aug 15 '19 at 14:28
2

@Erfan No sure about the speed but since I list the link above , so op can pick the one he want :-) – BENY Aug 15 '19 at 14:29
@Erfan test out slow than numba , I will attached the numba solution – BENY Aug 15 '19 at 14:32
I am comparing your solution with the numba solution in the link you posted (in terms of time) – Sreeram TP Aug 15 '19 at 14:35
@SreeramTP Divakar method is the best in term of timing – BENY Aug 15 '19 at 14:39
Yeah, I could see that. Numba method is a bit faster than the pure python loop and the solution posted by you. If you could add the solution using numba I can go ahead and accept the answer – Sreeram TP Aug 15 '19 at 14:45
1

@SreeramTP here you go , I would say my original method is more readable , but slowest . – BENY Aug 15 '19 at 14:51
Yeah true. But it serves the purpose. – Sreeram TP Aug 15 '19 at 14:53
1

You can get approx. a factor of 10 speedup if you use arrays instead of lists. eg. `result = np.empty_like(x); idx=0` write the result to the array like this `result[idx]=total; idx+=1` and shrink the array at the end `return result[:idx]` – max9111 Aug 15 '19 at 17:36
@max9111 can you elaborate.? Maybe consider adding an answer. – Sreeram TP Aug 16 '19 at 05:34

score 1 · Answer 2 · answered Aug 16 '19 at 08:46

This is only a comment WeNYoBen.

If you can avoid lists it is usually recommendable to avoid it.

Example

from numba import njit
import numpy as np

#with lists
@njit()
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result

#without lists
@njit()
def cumli_2(x, lim):
    total = 0.
    result = np.empty_like(x)
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0.
        result[i]=total
    return result

Timings

Without Numba (comment out@njit()):

x=(np.random.rand(1_000)-0.5)*5

  %timeit a=cumli(x, 0.)
  220 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  %timeit a=cumli_2(x, 0.)
  227 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There is no difference between using lists or arrays. But that's not the case if you Jit-compile this function.

With Numba:

  %timeit a=cumli(x, 0.)
  27.4 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  %timeit a=cumli_2(x, 0.)
  2.96 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even in a bit more complicated cases (final array size unknown, or only max array size known) it often makes sense to allocate an array and shrink it at the end, or in simple cases even to run the algorithm once to know the final array size and than do the real calculation.

Perfrom cumulative sum over a column but reset to 0 if sum become negative in Pandas

2 Answers2

Linked

Related