2

I have a dataframe which is called "df". It looks like this:

    a
0   2   
1   3   
2   0   
3   5   
4   1   
5   3   
6   1   
7   2   
8   2   
9   1   

I would like to produce a cummulative sum column which:

  • Sums the contents of column "a" cumulatively;
  • Until it gets a sum of "5";
  • Resets the cumsum total, to 0, when it reaches a sum of "5", and continues with the summing process;

I would like the dataframe to look like this:

    a   a_cumm_sum
0   2   2
1   3   5
2   0   0
3   5   5
4   1   1
5   3   4
6   1   5
7   2   2
8   2   4
9   1   5

In the dataframe, the column "a_cumm_summ" contains the results of the cumulative sum.

Does anyone know how I can achieve this? I have hunted through the forums. And saw similar questions, for example, this one, but they did not meet my exact requirements.

David Erickson
  • 16,433
  • 2
  • 19
  • 35
john_mon
  • 487
  • 1
  • 3
  • 13
  • 1
    Conditions like this make vectorized methods hard to use here, I'd probably look into using `numba`, which will let you speed this up and just write it as an explicit loop. – user3483203 Jan 02 '21 at 05:00
  • For solutions shown as of writing this comment, I tested each solution with `import pandas as pd; import numpy as np; np.random.seed(365); df = pd.DataFrame({'a': [np.random.randint(5) for _ in range(3000000)]})`. While all of the solutions work for the test data from the OP, only the [solution](https://stackoverflow.com/a/65536243/7758804) from [U11-Forward](https://stackoverflow.com/users/8708364/u11-forward) correctly resets the `cumsum` to 0 when the `cumsum` of two rows is greater than 5. – Trenton McKinney Jan 02 '21 at 08:00

4 Answers4

4

You can get the cumsum, and floor divide by 5. Then subtract the result of the floor division, multiplied by 5, from the below row's cumulative sum:

c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
df
Out[1]: 
   a  a_cumm_sum
0  2           2
1  3           5
2  0           0
3  5           5
4  1           1
5  3           4
6  1           5
7  2           2
8  2           4
9  1           5

Solution #2 (more robust):

Per Trenton's comment, A good, diverse sample dataset goes a long way to figure out unbreakable logic for these types of problems. I probably would have come up with a better solution first time around with a good sample dataset. Here is a solution that overcomes the sample dataset that Trenton mentioned in the comments. As shown, there are more conditions to handle as you have to deal with carry-over. On a large dataset, this would still be much more performant than a for-loop, but it is much more difficult logic to vectorize:

df = pd.DataFrame({'a': {0: 2, 1: 4, 2: 1, 3: 5, 4: 1, 5: 3, 6: 1, 7: 2, 8: 2, 9: 1}})
c = df['a'].cumsum()
g = 5 * (c // 5)
df['a_cumm_sum'] = (c.shift(-1) - g).shift().fillna(df['a']).astype(int)
over = (df['a_cumm_sum'].shift(1) - 5)
df['a_cumm_sum'] = df['a_cumm_sum'] - np.where(over > 0, df['a_cumm_sum'] - over, 0).cumsum()
s = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum']*-1, 0).cumsum()
df['a_cumm_sum'] = np.where((df['a_cumm_sum'] > 0) & (s > 0), s + df['a_cumm_sum'],
                              df['a_cumm_sum'])
df['a_cumm_sum'] = np.where(df['a_cumm_sum'] < 0, df['a_cumm_sum'].shift() + df['a'], df['a_cumm_sum'])
df
Out[2]: 
   a  a_cumm_sum
0  2         2.0
1  4         6.0
2  1         1.0
3  5         6.0
4  1         1.0
5  3         4.0
6  1         5.0
7  2         2.0
8  2         4.0
9  1         5.0
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • This solution seems to work for the test data, however it does not seem to work correctly if the sum of consecutive numbers is great than 5. For example, given values of `[2, 4, 1]` the solution should be `[2, 6, 1]`, however, this solution returns `[2, 6, 2]` because the remainder over 5 (1 in this case), is carried over in the cumulative sum. _Resets the cumsum total, to 0_ condition is not properly satisfied. – Trenton McKinney Jan 02 '21 at 07:11
  • 1
    If you run it with `np.random.seed(365); df = pd.DataFrame({'a': [np.random.randint(5) for _ in range(1000000)]})`, you see that an issue occurs beginning at row 4. `[(2, 2.0), (4, 6.0), (1, 1.0), (3, 4.0), (4, 3.0), (2, 5.0), (2, 0.0), (1, 1.0), (2, 3.0), (0, 3.0)]`. The sum should be `7` `(1 + 3 + 4)`. – Trenton McKinney Jan 02 '21 at 09:00
  • @TrentonMcKinney Thx, I'll try to get another perspective on it tomorrow. – David Erickson Jan 02 '21 at 09:08
  • 1
    The sum should be `8` `(1 + 3 + 4)`, not `7`, as I stated in my previous comment. Sorry, that was a typo. – Trenton McKinney Jan 02 '21 at 09:41
1

The assignment can be combined with a condition. The code is as follows:

import numpy as np
import pandas as pd

a = [2, 3, 0, 5, 1, 3, 1, 2, 2, 1]
df = pd.DataFrame(a, columns=["a"])
df["cumsum"] = df["a"].cumsum()
df["new"] = df["cumsum"]%5
df["new"][((df["cumsum"]/5)==(df["cumsum"]/5).astype(int)) & (df["a"]!=0)] = 5
df

The output is as follows:

    a   cumsum  new
0   2   2       2
1   3   5       5
2   0   5       0
3   5   10      5
4   1   11      1
5   3   14      4
6   1   15      5
7   2   17      2
8   2   19      4
9   1   20      5

Working:
Basically, take remainder for the cumulative sum for 5. In cases where the actual sum is 5 also becomes zero. So, for these cases, check if the value/5 == int(value/5). Then, remove cases where the actual value is zero.

sotmot
  • 1,256
  • 2
  • 9
  • 21
  • This works for the test data, but fails if the `cumsum` of 2 numbers is greater than 5. For example, given `[2, 4, 1]` the solution should be `[2, 6, 1]`, however, this solution returns `[2, 1, 2]` – Trenton McKinney Jan 02 '21 at 07:43
1

EDIT: As Trenton McKinney pointed out in the comments, OP likely wanted to reset it to 0 whenever the cumsum exceeded 5. This makes the definition to be a recurrence which is usually difficult to do with pandas/numpy (see David's solution). I'd recommend using numba to speed up the for loop in this case


Another alternative: using groupby

In [78]: df.groupby((df['a'].cumsum()% 5 == 0).shift().fillna(False).cumsum()).cumsum()
Out[78]:
   a
0  2
1  5
2  0
3  5
4  1
5  4
6  5
7  2
8  4
9  5
Asish M.
  • 2,588
  • 1
  • 16
  • 31
  • This does not produce the correct result if the sum of consecutive numbers is greater than 5. The condition _Resets the cumsum total, to 0_ is not properly satisfied. For example, given values of `[2, 4, 1]` the solution should be `[2, 6, 1]`, however, this solution returns `[2, 6, 7]` – Trenton McKinney Jan 02 '21 at 07:44
  • 1
    I took OPs "Until it gets a sum of "5" " to mean exactly 5. Your comment makes more sense! – Asish M. Jan 02 '21 at 16:01
0

You could try using this for loop:

lastvalue = 0
newcum = []
for i in df['a']:
    if lastvalue >= 5:
        lastvalue = i
    else:
        lastvalue += i
    newcum.append(lastvalue)
df['a_cum_sum'] = newcum
print(df)

Output:

   a  a_cum_sum
0  2          2
1  3          5
2  0          0
3  5          5
4  1          1
5  3          4
6  1          5
7  2          2
8  2          4
9  1          5

The above for loop iterates through the a column, and when the cumulative sum is 5 or more, it resets it to 0 then adds the a column's value i, but if the cumulative sum is lower than 5, it just adds the a column's value i (the iterator).

U13-Forward
  • 69,221
  • 14
  • 89
  • 114
  • 3
    I'm not the one who downvoted, however, my guess is because, while this may solve the question, this is not a good `pandas` solution, because a `for-loop` with `pandas`, is an `anti-pattern`. As you can see from the other solutions, there are more efficient, vectorized ways to solve the problem. – Trenton McKinney Jan 02 '21 at 06:52
  • However, I do not think this [solution](https://stackoverflow.com/a/65536164/7758804) by [David Erickson](https://stackoverflow.com/users/6366770/david-erickson) is correct. – Trenton McKinney Jan 02 '21 at 07:06
  • @TrentonMcKinney Yeah I am just posted a solution using `for` loop which may not be the best. – U13-Forward Jan 02 '21 at 07:07
  • I stand by my comment regarding `for-loops`, but +1 for being the only solution that meets the condition `Resets the cumsum total, to 0`. The other solutions work only for the test data, but in the case when the `cumsum` of 2 numbers is greater than 5, the other solutions fail. – Trenton McKinney Jan 02 '21 at 07:38
  • I should have said _...there may be more efficient..._. – Trenton McKinney Jan 02 '21 at 09:35
  • 1
    @TrentonMcKinney Thanks for upvoting me :-) yeah it's not so efficient but the way that is the most consistent in working :-) – U13-Forward Jan 02 '21 at 09:39