13

Consider my dataframe, df:

data  data_binary  sum_data
  2       1            1
  5       0            0
  1       1            1
  4       1            2
  3       1            3
  10      0            0
  7       0            0
  3       1            1

How can I calculate the cumulative sum of data_binary within groups of contiguous 1 values?

The first group of 1's had a single 1 and sum_data has only a 1. However, the second group of 1's has 3 1's and sum_data is [1, 2, 3].

I've tried using np.where(df['data_binary'] == 1, df['data_binary'].cumsum(), 0), but that returns

array([1, 0, 2, 3, 4, 0, 0, 5])

Which is not what I want.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
GrayHash
  • 297
  • 2
  • 9

3 Answers3

21

You want to take the cumulative sum of data_binary and subtract the most recent cumulative sum where data_binary was zero.

b = df.data_binary
c = b.cumsum()
c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)

Output

0    1
1    0
2    1
3    2
4    3
5    0
6    0
7    1
Name: data_binary, dtype: int64

Explanation

Let's start by looking at each step side by side

cols = ['data_binary', 'cumulative_sum', 'nan_non_zero', 'forward_fill', 'final_result']
print(pd.concat([
        b, c,
        c.mask(b != 0),
        c.mask(b != 0).ffill(),
        c.sub(c.mask(b != 0).ffill(), fill_value=0).astype(int)
    ], axis=1, keys=cols))

Output

data_binary  cumulative_sum  nan_non_zero  forward_fill  final_result
0            1               1             NaN           NaN             1
1            0               1             1.0           1.0             0
2            1               2             NaN           1.0             1
3            1               3             NaN           1.0             2
4            1               4             NaN           1.0             3
5            0               4             4.0           4.0             0
6            0               4             4.0           4.0             0
7            1               5             NaN           4.0             1

The problem with cumulative_sum is that the rows where data_binary is zero, do not reset the sum. And that is the motivation for this solution. How do we "reset" the sum when data_binary is zero? Easy! I slice the cumulative sum where data_binary is zero and forward fill the values. When I take the difference between this and the cumulative sum, I've effectively reset the sum.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Very interesting solution. How could this be applied along rows? (axis=1) I am trying without any success. – Newbielp Jan 08 '21 at 18:50
  • The number of columns (6) in the output for the second part does not seem to match the number of column headers (5). Or in other words, the last column seems to be missing a column header. – Peter Mortensen Jan 16 '22 at 12:23
9

I think you can groupby with DataFrameGroupBy.cumsum by Series, where first compare the next value by the shifted column if not equal (!=) and then create groups by cumsum. Last, replace 0 by column data_binary with mask:

print (df.data_binary.ne(df.data_binary.shift()).cumsum())
0    1
1    2
2    3
3    3
4    3
5    4
6    4
7    5
Name: data_binary, dtype: int32

df['sum_data1'] = df.data_binary.groupby(df.data_binary.ne(df.data_binary.shift()).cumsum())
                                .cumsum()
df['sum_data1'] = df['sum_data1'].mask(df.data_binary == 0, 0)
print (df)
   data  data_binary  sum_data  sum_data1
0     2            1         1          1
1     5            0         0          0
2     1            1         1          1
3     4            1         2          2
4     3            1         3          3
5    10            0         0          0
6     7            0         0          0
7     3            1         1          1
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0

If you want the excellent piRSquared's answer in just one single command:

df['sum_data'] = df[['data_binary']].apply(
    lambda x: x.cumsum().sub(x.cumsum().mask(x != 0).ffill(), fill_value=0).astype(int), 
    axis=0)

Note that the double squared bracket on the right hand side is necessary to make a one-column DataFrame instead of a Series in order to use apply with the axis argument (which is not available when apply is used on Series).

gibbone
  • 2,300
  • 20
  • 20