14

I have a pandas dataframe with two columns like this,

Item    Value
0   A   7
1   A   2
2   A   -6
3   A   -70
4   A   8
5   A   0

I want to cumulative sum over the column, Value. But while creating the cumulative sum if the value becomes negative I want to reset it back to 0.

I am currently using a loop shown below to perform this,

sum_ = 0
cumsum = []

for val in sample['Value'].values:
    sum_ += val
    if sum_ < 0:
        sum_ = 0
    cumsum.append(sum_)

print(cumsum) # [7, 9, 3, 0, 8, 8]

I am looking for a more efficient way to perform this in pure pandas.

Sreeram TP
  • 11,346
  • 7
  • 54
  • 108
  • 5
    I think we do not have pandas method can achieve this – BENY Aug 15 '19 at 13:49
  • I was thinking the same and finnally settled with the solution with the loop i posted in the question. I was wondering I am missing out some pandas trick that could do the magic – Sreeram TP Aug 15 '19 at 13:56
  • What you did is more like what I can offer, only little different I may using numba – BENY Aug 15 '19 at 13:57
  • I am not familiar with numba. How much improvement of performance (in terms of time) can I expect.? If you can post the code as answer I will check for myself and let you know whether it will be good for me. – Sreeram TP Aug 15 '19 at 13:58
  • In term of performance pure python is not bad :-) – BENY Aug 15 '19 at 14:00
  • Can you post the code if possible.? – Sreeram TP Aug 15 '19 at 14:04
  • 3
    https://stackoverflow.com/questions/56904390/restart-cumsum-and-get-index-if-cumsum-more-than-value/56904899#56904899 you can get some solution from there , even that is not 100% same – BENY Aug 15 '19 at 14:10
  • 2
    Great question, posted this as an [improvement suggestion](https://github.com/pandas-dev/pandas/issues/27935) to `pandas` on GitHub @WeNYoBen – Erfan Aug 15 '19 at 14:39
  • @Erfan maybe it is better adding the upper and lower :-) like clip – BENY Aug 15 '19 at 14:44
  • Good suggestion, made an edit @WeNYoBen – Erfan Aug 15 '19 at 14:50

2 Answers2

8

Slightly modify also this method is slow that numba solution

sumlm = np.frompyfunc(lambda a,b: 0 if a+b < 0 else a+b,2,1)
newx=sumlm.accumulate(df.Value.values, dtype=np.object)
newx
Out[147]: array([7, 9, 3, 0, 8, 8], dtype=object)

numba solution

from numba import njit
@njit
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result
cumli(df.Value.values,0)
Out[166]: [7, 9, 3, 0, 8, 8]
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Awsome, works as expected. I will get back to you after testing the time. – Sreeram TP Aug 15 '19 at 14:28
  • Was this method ever tested against the solutions of divakar and pirsquared? Just wondering in terms of speed – Erfan Aug 15 '19 at 14:28
  • 2
    @Erfan No sure about the speed but since I list the link above , so op can pick the one he want :-) – BENY Aug 15 '19 at 14:29
  • @Erfan test out slow than numba , I will attached the numba solution – BENY Aug 15 '19 at 14:32
  • I am comparing your solution with the numba solution in the link you posted (in terms of time) – Sreeram TP Aug 15 '19 at 14:35
  • @SreeramTP Divakar method is the best in term of timing – BENY Aug 15 '19 at 14:39
  • Yeah, I could see that. Numba method is a bit faster than the pure python loop and the solution posted by you. If you could add the solution using numba I can go ahead and accept the answer – Sreeram TP Aug 15 '19 at 14:45
  • 1
    @SreeramTP here you go , I would say my original method is more readable , but slowest . – BENY Aug 15 '19 at 14:51
  • Yeah true. But it serves the purpose. – Sreeram TP Aug 15 '19 at 14:53
  • 1
    You can get approx. a factor of 10 speedup if you use arrays instead of lists. eg. `result = np.empty_like(x); idx=0` write the result to the array like this `result[idx]=total; idx+=1` and shrink the array at the end `return result[:idx]` – max9111 Aug 15 '19 at 17:36
  • @max9111 can you elaborate.? Maybe consider adding an answer. – Sreeram TP Aug 16 '19 at 05:34
1

This is only a comment WeNYoBen.

If you can avoid lists it is usually recommendable to avoid it.

Example

from numba import njit
import numpy as np

#with lists
@njit()
def cumli(x, lim):
    total = 0
    result = []
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0
        result.append(total)
    return result

#without lists
@njit()
def cumli_2(x, lim):
    total = 0.
    result = np.empty_like(x)
    for i, y in enumerate(x):
        total += y
        if total < lim:
            total = 0.
        result[i]=total
    return result

Timings

Without Numba (comment out@njit()):

x=(np.random.rand(1_000)-0.5)*5

  %timeit a=cumli(x, 0.)
  220 µs ± 2.25 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
  %timeit a=cumli_2(x, 0.)
  227 µs ± 1.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

There is no difference between using lists or arrays. But that's not the case if you Jit-compile this function.

With Numba:

  %timeit a=cumli(x, 0.)
  27.4 µs ± 210 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)
  %timeit a=cumli_2(x, 0.)
  2.96 µs ± 32.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

Even in a bit more complicated cases (final array size unknown, or only max array size known) it often makes sense to allocate an array and shrink it at the end, or in simple cases even to run the algorithm once to know the final array size and than do the real calculation.

max9111
  • 6,272
  • 1
  • 16
  • 33