Pandas: cumsum with conditions

Question

I have the following dataframe in pandas:

I would like to calculate a column C as the cumulative difference between B and A but with at the condition that C cannot be lower than 0 and higher than 15. How could I obtain the following result with the cumsum function?

   A     B     C
   5     0     0
   5     10    5
   5     0     0
   5     0     0
   5     50    15

mozway · Answer 1 · 2023-08-04T04:47:07.643

2

@郑远程 has the correct approach (this answer should be accepted), as the desired operation requires a loop and cannot be vectorized:

Here is a just variant using a function and numba to improve efficiency for those interested:

import numba as nb

@nb.jit(nopython=True)
def cumreset(s):
    out = []
    tot = 0
    for v in s:
        tot = max(0, min(tot+v, 15))
        out.append(tot)
    return out

df['C'] = cumreset(df['B'].sub(df['A']).to_numpy())

Output:

   A   B   C
0  5   0   0
1  5  10   5
2  5   0   0
3  5   0   0
4  5  50  15

Comparison of the python and numba loops:

edited Aug 04 '23 at 04:47

answered Aug 03 '23 at 15:37

mozway

194,879
13
39
75

could a variation of this question provide a vectorized solution? https://stackoverflow.com/questions/65223420/cumsum-on-pandas-df-with-reset-to-zero-for-negative-cumulative-values/65223451#65223451 – rhug123 Aug 03 '23 at 21:23

郑远程 · Answer 2 · 2023-08-03T13:51:11.090

Oops! My apologies for that oversight. The issue is occurring because the cumulative_diff function is returning a list within the expanding().apply() method, causing a type mismatch.

To fix this, we can change our approach slightly by using a loop to calculate the cumulative sum. Here's the corrected code:

import pandas as pd

df = pd.DataFrame({
    'A': [5, 5, 5, 5, 5],
    'B': [0, 10, 0, 0, 50]
})

# Calculate the difference between B and A
df['diff'] = df['B'] - df['A']

# Initialize a variable to hold the cumulative difference
cum_diff = 0

# Create an empty list to hold the results
results = []

# Loop through the differences, calculate the cumulative sum, and append to the results
for value in df['diff']:
    cum_diff += value
    cum_diff = max(0, cum_diff)  # Reset to 0 if cumulative difference is less than 0
    cum_diff = min(15, cum_diff) # Reset to 15 if cumulative difference is more than 15
    results.append(cum_diff)

# Assign the results to the new 'C' column
df['C'] = results

# Remove the temporary difference column
df.drop('diff', axis=1, inplace=True)

print(df)

This will yield the desired output:

   A   B   C
0  5   0   0
1  5  10   5
2  5   0   0
3  5   0   0
4  5  50  15

I hope this corrected version works for you, and I'm sorry again for the confusion.

TY very much for this answer. I think you are very close from what I wanted but I have an error when trying the expanding function in your code : 'TypeError: must be real number, not list' — RomB, Aug 03 '23 at 13:44
You can't use vectorial code with a sum reset, a loop like in this answer is the correct approach — mozway, Aug 03 '23 at 15:05

score 0 · Answer 3 · answered Aug 03 '23 at 13:35

0

I am not sure that your desired output corresponds to your text. I have followed your text above:

df['diff'] = df['B'] - df['A']
df['diff'] = np.where((df['diff']>0) & (df['diff']<15), df['diff'], 0 )
df['C'] = df['diff'].cumsum()
df

   A   B  diff  C
0  5   0     0  0
1  5  10     5  5
2  5   0     0  5
3  5   0     0  5
4  5  50     0  5

answered Aug 03 '23 at 13:35

gtomer

5,643
1
10
21

TY for this answer, but this is not the output I would like in C – RomB Aug 03 '23 at 13:46
Pls explain better how did you get to the expected output – gtomer Aug 03 '23 at 13:51

score 0 · Answer 4 · answered Aug 03 '23 at 21:20

0

Try this:

n = df['B'].cumsum().sub(df['A'].cumsum())

df.assign(C = df['A'].groupby(n.lt(0).iloc[::-1].cumsum()).cumsum().rsub(df['B'].cumsum()).clip(0,15).where(n.ge(0),0))

Output:

   A   B   C
0  5   0   0
1  5  10   5
2  5   0   0
3  5   0   0
4  5  50  15

answered Aug 03 '23 at 21:20

rhug123

7,893
1
9
24

1

To address your comment to my answer, I really don't think a vectorial method can be used. This one fails as soon as the dataframes is a bit longer. Try: `N = 10 ; np.random.seed(0) ; df = pd.DataFrame({'A': np.random.randint(0, 50, size=N), 'B': np.random.randint(0, 50, size=N)})`. To have a working solution you would need to make recursive/iterative groups, which would be inefficient. – mozway Aug 04 '23 at 04:48

Pandas: cumsum with conditions

4 Answers4