15

I would like to do a cumulative sum on a field but reset the aggregated value whenever a 0 is encountered.

Here is an example of what I want :

data.frame(campaign = letters[1:4] , 
       date=c("jan","feb","march","april"),
       b = c(1,0,1,1) ,
       whatiwant = c(1,0,1,2)
       )

 campaign  date b whatiwant
1        a   jan 1         1
2        b   feb 0         0
3        c march 1         1
4        d april 1         2
josliber
  • 43,891
  • 12
  • 98
  • 133
patpat
  • 197
  • 2
  • 11
  • 2
    The answers to [this question I asked a couple of weeks ago](http://stackoverflow.com/questions/32247414/create-sequential-counter-that-restarts-on-a-condition-within-panel-data-groups) should help you solve this problem. – ulfelder Sep 10 '15 at 12:36
  • Related: [Create counter within consecutive runs of certain values](https://stackoverflow.com/questions/5012516/create-counter-within-consecutive-runs-of-certain-values) – Henrik May 20 '20 at 12:19

5 Answers5

23

Another base would be just

with(df, ave(b, cumsum(b == 0), FUN = cumsum))
## [1] 1 0 1 2

This will just divide column b to groups according to 0 appearances and compute the cumulative sum of b per these groups


Another solution using the latest data.table version (v 1.9.6+)

library(data.table) ## v 1.9.6+
setDT(df)[, whatiwant := cumsum(b), by = rleid(b == 0L)]
#    campaign  date b whatiwant
# 1:        a   jan 1         1
# 2:        b   feb 0         0
# 3:        c march 1         1
# 4:        d april 1         2

Some benchmarks per comments

set.seed(123)
x <- sample(0:1e3, 1e7, replace = TRUE)
system.time(res1 <- ave(x, cumsum(x == 0), FUN = cumsum))
# user  system elapsed 
# 1.54    0.24    1.81 
system.time(res2 <- Reduce(function(x, y) if (y == 0) 0 else x+y, x, accumulate=TRUE))
# user  system elapsed 
# 33.94    0.39   34.85 
library(data.table)
system.time(res3 <- data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)])
# user  system elapsed 
# 0.20    0.00    0.21 

identical(res1, as.integer(res2))
## [1] TRUE
identical(res1, res3$whatiwant)
## [1] TRUE
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • 2
    This, annoyingly, needs to calculate `cumsum` twice. :-/ – Konrad Rudolph Sep 10 '15 at 12:42
  • Can you try with `with(rle(df1$b!=0), sequence(lengths)*rep(values, lengths))` – akrun Sep 10 '15 at 12:57
  • @akrun I'm getting different results. Maybe you right and we are wrong, dunno. – David Arenburg Sep 10 '15 at 13:02
  • I coded it based on the assumption that the column values were 0, 1. Your example is different so it wouldn't work. – akrun Sep 10 '15 at 13:03
  • @akrun oh, so maybe undelete your answer and put that as an assumption. Your solution should be very efficient in that case I'd guess. – David Arenburg Sep 10 '15 at 13:04
  • @DavidArenburg It's okay. I think your solutions are general. – akrun Sep 10 '15 at 13:05
  • Thanks it is perfect, I also like tried your your one akrun and works pretty well too. – patpat Sep 10 '15 at 14:48
  • @patpat akruns solution works only if your column is binary, – David Arenburg Sep 10 '15 at 14:56
  • @akrun your `rle` solution will work if you do `rle(data ==0)` or something to that effect, making it essentially binary. (And since `rle` is my favorite Rswissarmycodeknife, I hope you'll edit and undelete your answer :-) – Carl Witthoft Sep 10 '15 at 15:51
  • @CarlWitthoft Thanks for the comments. I did try with some general data such as `v2 <- c(3, 2, 0, 1,1,2,0,4)` . But, after a few attempts, I couldn't find a concise way enough to beat (:-)) DavidArenburg's elegant code. If you come up with some, please do post it. – akrun Sep 10 '15 at 16:08
13

Another late idea:

ff = function(x)
{
    cs = cumsum(x)
    cs - cummax((x == 0) * cs)
}
ff(c(0, 1, 3, 0, 0, 5, 2))
#[1] 0 1 4 0 0 5 7

And to compare:

library(data.table)
ffdt = function(x) 
    data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)]$whatiwant

x = as.numeric(x) ##because 'cumsum' causes integer overflow
identical(ff(x), ffdt(x))
#[1] TRUE
microbenchmark::microbenchmark(ff(x), ffdt(x), times = 25)
#Unit: milliseconds
#    expr      min       lq   median       uq      max neval
#   ff(x) 315.8010 362.1089 372.1273 386.3892 405.5218    25
# ffdt(x) 374.6315 407.2754 417.6675 447.8305 534.8153    25
alexis_laz
  • 12,884
  • 4
  • 27
  • 37
5

You could use the Reduce function with a custom function that returns 0 when the new value encountered is 0 and otherwise adds the new value to the accumulated value:

Reduce(function(x, y) if (y == 0) 0 else x+y, c(1, 0, 1, 1), accumulate=TRUE)
# [1] 1 0 1 2
josliber
  • 43,891
  • 12
  • 98
  • 133
0

hutilscpp::cumsum_reset is designed for this purpose. The first argument is a logical vector, indicating when the cumulative sum should continue. The second argument is the input to the cumulative sum itself

library(hutilscpp)
b <- c(1, 0, 1, 1)
cumsum_reset(as.logical(b), b)

On my machine, compared to the data.table function above, this use of cumsum_reset is about 3 times faster.

Hugh
  • 15,521
  • 12
  • 57
  • 100
0

Another variant using rep of the values at places where to reset. This uses cumsum only once but the drawback will lead to large numbers and can cause an integer overflow or inaccurate numeric values and will propagate NA to all following groups.

x <- cumsum(DF$b)
i <- which(DF$b == 0)
x - rep(c(0, x[i]), diff(c(1L, i, length(x)+1L)))
#[1] 1 0 1 2

Another way is to use Rcpp - in this case for integer.

Rcpp::cppFunction('IntegerVector csrA(const IntegerVector x, int z=0) {
  IntegerVector out(no_init(x.size()));
  int init = z == NA_INTEGER ? 0 : z;
  int s = 0;
  for(int i = 0; i < x.size(); ++i) {
    if(x[i] == z) s = init;
    else [[likely]] s += x[i];
    out[i] = s;
  }
  return out;
}')
csrA(DF$b)
#[1] 1 0 1 2

A variant also taking care of NA might look like:

Rcpp::cppFunction('IntegerVector csr(const IntegerVector x, int z=0) {
  IntegerVector out(no_init(x.size()));
  int init = z == NA_INTEGER ? 0 : z;
  LogicalVector isNA = is_na(x);
  int s = 0;
  for(int i = 0; i < x.size(); ++i) {
    if(x[i] == z) s = init;
    else [[likely]] if(isNA[i] || s == NA_INTEGER) s = NA_INTEGER;
      else [[likely]] s += x[i];
    out[i] = s;
  }
  return out;
}')

csr(c(2,4,3,0,3,5), 0)
#[1] 2 6 9 0 3 8

csr(c(2,NA,3,0,3,5), 0)
#[1]  2 NA NA  0  3  8

csr(c(2,4,3,1,3,5), 1)
#[1] 2 6 9 1 4 9

csr(c(2,4,3,NA,3,5), NA)
#[1] 2 6 9 0 3 8

Data

DF <- data.frame(campaign = letters[1:4] , 
                 date=c("jan","feb","march","april"),
                 b = c(1,0,1,1) ,
                 whatiwant = c(1,0,1,2)
                 )

Benchmark - Based on @David Arenburg

set.seed(123)
#Using 1e3 instead of 1e2 would lead to an integer overflow for whichRep and cummax
x <- sample(0:1e2, 1e7, TRUE)

library(data.table)

bench::mark(
ave = ave(x, cumsum(x == 0), FUN = cumsum),
data.table = data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)]$whatiwant,
cummax = {cs = cumsum(x)
  cs - cummax((x == 0) * cs)},
whichRep = {y <- cumsum(x)
i <- which(x == 0)
y - rep(c(0, y[i]), diff(c(1L, i, length(x)+1L)))},
RcppNA = csr(x),
RcppSimple = csrA(x)
)

Result

  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc total_…¹
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl> <bch:tm>
1 ave           1.06s    1.06s     0.945   751.8MB     3.78     1     4    1.06s
2 data.table 199.01ms 266.26ms     3.76    231.9MB     1.88     2     1 532.53ms
3 cummax      90.57ms  93.76ms    10.4     152.6MB     6.92     6     4  578.4ms
4 whichRep     74.5ms  77.05ms    12.9     195.6MB    11.1      7     6 541.63ms
5 RcppNA      39.55ms  40.84ms    24.2      76.3MB     5.60    13     3  536.1ms
6 RcppSimple  29.73ms  30.59ms    32.3      38.1MB     3.80    17     2  526.1ms
GKi
  • 37,245
  • 2
  • 26
  • 48