2

I have 34 rasters (nrow: 17735, ncol: 11328, ncell: 200902080) with values 0 and 1, of 4Mb each. I want the cumulative sum of those values with zero reset.

I tried several alternatives based on: Cumulative sum that resets when 0 is encountered

library(microbenchmark)
library(compiler)
library(dplyr)
library(data.table)

x <- c(0,0,0,1,1,0,1,1,1,1,1,1,1,1,1,1,1,1,1,1,0)

fun = function(x)
{ cs = cumsum(x)
  cs - cummax((x == 0) * cs)
}
funC <- cmpfun(fun)

microbenchmark(
funcioEx = fun(x),
funComEx = funC(x),
lapplyEx = unname(unlist(lapply(split(x,cumsum(c(0,diff(x) != 0))), cumsum))),
dataTaEx = data.table(x)[, whatiwant := cumsum(x), by = rleid(x==0L)],
reduceEx = Reduce(function(x, y) if (y == 0) 0 else x+y, x, accumulate=TRUE)
)

I would like to optimize this procedure for my data, because with the second option (funComEx, the fastest) it takes about 3 hours.

mSandoval
  • 35
  • 4
  • I think `fun` is probably one of the fastest ways in R. Looking into `Rcpp` package for more speed might help. – Shree Jul 29 '19 at 23:26

1 Answers1

2

Rcpp may help a little

library(Rcpp)
cppFunction(
    "IntegerVector foo(NumericVector vect){
    int N = vect.size();
    IntegerVector ans(N);
    ans[0] = vect[0];
    for (int i = 1; i < N; i++){
      if(vect[i] > 0){
        ans[i] = ans[i-1] + vect[i];
      } else {
        ans[i] = 0;
      }
    }
    return(ans);
  }"
)

set.seed(42)
x = sample(0:1, 1e4, TRUE)

identical(foo(x), fun(x))
#[1] TRUE

microbenchmark(
    funcioEx = fun(x),
    funComEx = funC(x),
    lapplyEx = unname(unlist(lapply(split(x,cumsum(c(0,diff(x) != 0))), cumsum))),
    dataTaEx = data.table(x)[, whatiwant := cumsum(x), by = rleid(x==0L)],
    reduceEx = Reduce(function(x, y) if (y == 0) 0 else x+y, x, accumulate=TRUE),
    foo_RCPP = foo(x)
)
#Unit: microseconds
#     expr       min         lq        mean     median         uq       max neval
# funcioEx    98.238   104.2495   118.81500   113.1935   121.1110   280.637   100
# funComEx    97.358   103.2230   113.81515   112.1670   118.1785   220.522   100
# lapplyEx 17810.638 18888.9055 20130.20765 19399.7415 20641.0550 28073.981   100
# dataTaEx  3435.387  3832.0025  4468.77932  4023.6395  4347.3840 17053.181   100
# reduceEx  7472.515  8174.4020  9614.23122  8634.7985 10177.1305 15719.788   100
# foo_RCPP    52.491    62.6085    80.65777    66.5670    72.4320  1102.315   100
d.b
  • 32,245
  • 6
  • 36
  • 77
  • Thank you. The Rcpp option does not work. I installed Rtools from here https://cran.r-project.org/bin/windows/Rtools/ and I get the following error: Warning message: In system(cmd) : 'make' not found Error in sourceCpp(code = code, env = env, rebuild = rebuild, cacheDir = cacheDir,: Error 1 occurred building shared library. WARNING: The tools required to build C++ code for R were not found. Please download and install the appropriate version of Rtools: http://cran.r-project.org/bin/windows/Rtools/ – mSandoval Jul 30 '19 at 01:16
  • I was able to install Rtools and ran the script. Significantly improves speed. However, when I apply it on a rasters stack (rasterStack) I get the following error: `Error in foo (rasterStack): Not compatible with requested type: [type = S4; target = double].` I think this is because in the example I used a vector, but I don't know how to adapt in your script for a raster. I appreciate any suggestions. – mSandoval Jul 30 '19 at 20:25
  • This is an example with an array and the corresponding error: `library(raster) # Sample data r1 <- r2 <- r3 <- raster(matrix(0, 10, 10)) r <- stack(r1, r2, r3) one <- c(15:20, 25:30, 35:40, 45:50, 55:60) r[one] <- 1 library(Rcpp) cppFunction( "IntegerVector foo(NumericVector vect){ int N = vect.size(); IntegerVector ans(N); ans[0] = vect[0]; for (int i = 1; i < N; i++){ if(vect[i] > 0){ ans[i] = ans[i-1] + vect[i]; } else { ans[i] = 0; } } return(ans); }" ) foo(r)` – mSandoval Jul 31 '19 at 15:39
  • This script works with a "Rasterbrick" object but not with "RasterStack" and I have the rasters in a RasterStack. I tried this to reproduce the error: `rStack <- stack(r); rFooStack = rStack ; rFooStack@layers@values[] = t(apply(rStack@layers@values, 1, foo)) Error in apply(rStack@layers@values, 1, foo): trying to get slot "values" from an object of a basic class ("list") with no slots.` That is the error it gives me when I apply it in a RasterStack. Thanks again. – mSandoval Jul 31 '19 at 21:49
  • Finally, I could run the script with my data. Now the error is: `Error: cannot allocate vector of size 25.4 Gb`. I used `object.size()` of the rasterStack and the result is `440224 bytes`. Any other suggestions? – mSandoval Aug 01 '19 at 00:29