Conditional cumulative sum in Python/Pandas

Question

Consider my dataframe, df:

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1            2
  3       1            3
  10      0            0
  7       0            0
  3       1            1

How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?

The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].

I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns

array([1, 0, 2, 3, 4, 0, 0, 5])

Which is not what I want.

What have you tried and what is your requirement? Your question is not clear. Edit it. — Mohammad Yusuf, Jan 02 '17 at 02:46
still not a clue what result you want and what you have tried that does not work. — dartdog, Jan 02 '17 at 02:55

score 21 · Answer 1 · edited Jan 16 '22 at 12:20

You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)

Output

0    1
1    0
2    1
3    2
4    3
5    0
6    0
7    1
Name: data_binary, dtype: int64

Explanation

Let's start by looking at each step side by side

cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
        b, c,
        c.mask(b != 0),
        c.mask(b != 0).ffill(),
        c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    ], axis=1, keys=cols))

Output

data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
0            1               1             NaN           NaN             1
1            0               1             1.0           1.0             0
2            1               2             NaN           1.0             1
3            1               3             NaN           1.0             2
4            1               4             NaN           1.0             3
5            0               4             4.0           4.0             0
6            0               4             4.0           4.0             0
7            1               5             NaN           4.0             1

The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

Very interesting solution. How could this be applied along rows? (axis=1) I am trying without any success. — Newbielp, Jan 08 '21 at 18:50
The number of columns (6) in the output for the second part does not seem to match the number of column headers (5). Or in other words, the last column seems to be missing a column header. — Peter Mortensen, Jan 16 '22 at 12:23

score 9 · Answer 2 · edited Jan 16 '22 at 12:09

I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:

print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: data_binary, dtype: int32

df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                .cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
   data  data_binary  sum_data  sum_data1
0     2            1         1          1
1     5            0         0          0
2     1            1         1          1
3     4            1         2          2
4     3            1         3          3
5    10            0         0          0
6     7            0         0          0
7     3            1         1          1

You now have maverick hat! – piRSquared Jan 05 '17 at 18:51 — piRSquared, Jan 05 '17 at 18:51

gibbone · Answer 3 · 2019-05-09T10:09:47.017

If you want the excellent piRSquared's answer in just one single command:

df['sum_data'] = df[['data_binary']].apply(
    lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), 
    axis=0)

Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

Conditional cumulative sum in Python/Pandas

3 Answers3

Output

Output

Linked