10

I work a lot with large 3D array (latitude, longitude, and time), with size of for example 720x1440x480. Usually, I need to make operations over time for each latitude and longitude, for example, getting the average (resulting in a 2D array) or getting a rolling mean in time (resulting in a 3D array), or more complex functions.

My question is: which is the package (or way do it) the most efficient and fast?

I know one option is base R, using the apply function and for rolling function mixed with the package zoo that provides the rollapply function. Another way is with tidyverse and another way is with data.table. And combinations between these packages. But is there one that is the fastest?

For example if I have this cube of data:

data <- array(rnorm(721*1440*480),dim = c(721,1440,480))

Which dimensions are latitude, longitude, and time like this:

lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)

And I usually need to do things like this (this is in base R + zoo):

# Average in time
average_data <- apply(data, 1:2, mean)

# Rolling mean, width of window = 3
library(zoo)
rolling_mean <- function(x){
  return(rollapply(data = x, width = 3, by = 1, FUN = mean))
}
rolling_mean_data <- apply(X = data, MARGIN = 1:2,
                      FUN = rolling_mean)
rolling_mean_data <- aperm(a = rolling_mean_data,perm = c(2,3,1))

The functions may change being not necessary always mean, also could be others statistics like standard deviation or correlation with a time series.

So, which is the fastest way to do this type of calculus?

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
Santiago I. Hurtado
  • 1,113
  • 1
  • 10
  • 23
  • 2
    Can you make your problem reproducible? – markus Oct 07 '20 at 14:44
  • 1
    If your time dimension would be bigger, then it make sense to dcast lat/long/both to columns and do `frollmean(.SD)` on them as they will be computed in parallel (`setDTthreads(0L)` to use all logical cpus). For 480 length input it is probably not worth. For examples of parallel processing see `?frollmean`. – jangorecki Oct 22 '20 at 15:29
  • You may want to use a specialized geoprocessing library for this. This tutorial on `gdalcubes` may be of interest: https://appelmar.github.io/opengeohub_summerschool2020/tutorial_01.html – Rich Pauloo Oct 27 '20 at 23:58

2 Answers2

9

Usually data.table is quite fast and memory efficient. You might want to check ?data.table::frollapply and note, that as.data.table provides a S3 method for class 'array'.

The following compares the base R code snippets you provided with their data.table counterparts (I'm not really familiar with tidyverse). I'm focusing only on the calculation itself not on the output format - so for the data.table solution the output isn't converted back to an array. Here is some info on that - also using aperm as you do above.

The benchmark was created using the reduced dataset (I'll update with the results of your datasets once the computing is done).

Edit: Just updated the benchmark results using the entire dataset.

library(data.table)
library(microbenchmark)
library(zoo)

lat <- seq(from = -90, to = 90, by = 0.25)
lon <- seq(from = 0, to = 359.75, by = 0.25)
time <- seq(from = as.Date('1980-01-01'), by = 'month', length.out = 480)

# data <- array(rnorm(721 * 1440 * 480), dim = c(721, 1440, 480), dimnames = list(lat = lat, lon = lon, time = time))
data <- array(rnorm(72 * 144 * 48), dim = c(72, 144, 48), dimnames = list(lat = NULL, lon  = NULL, time  = NULL)) # reduced dataset
DT <- as.data.table(data)
setorder(DT, time) # as per @jangorecki we need to ensure that DT is sorted by time

rolling_mean <- function(x) {
  return(rollapply(
    data = x,
    width = 3,
    by = 1,
    FUN = mean
  ))
}

microbenchmark(
  "mean - base R" = {
    apply(data, 1:2, mean)
  },
  "mean - data.table" = {
    DT[, .("mean_value" = mean(value)), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
              expr       min        lq      mean    median        uq       max neval
     mean - base R 33.152048 33.152048 33.152048 33.152048 33.152048 33.152048     1
 mean - data.table  8.287379  8.287379  8.287379  8.287379  8.287379  8.287379     1


microbenchmark(
  "rollmean - base R" = {
      apply(X = data,
            MARGIN = 1:2,
            FUN = rolling_mean)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  times = 1L
)

Unit: seconds
                  expr       min        lq      mean    median        uq       max neval
     rollmean - base R 477.26644 477.26644 477.26644 477.26644 477.26644 477.26644     1
 rollmean - data.table  48.80027  48.80027  48.80027  48.80027  48.80027  48.80027     1

As you can see, using data.table provides the results significantly faster.

ismirsehregal
  • 30,045
  • 5
  • 31
  • 78
4

For efficient means for the array, we can transpose an array and use colSums(...). This is faster than (my favorite) and does not require coercion from an array to a data.frame like object. Thanks to @ismirsehregal for the structure of the benchmarks.

Note, this is the smaller dataset. While there is a conversion time penalty for converting from array -> data.table, the data.table method would likely become faster at some point as it can make use of parallel cores unlike the colMeans approach. I didn't have enough RAM to use the larger data.set.

colMeans(aperm(data, c(3L, 1L, 2L)))

all.equal(apply(data, 1:2, mean), colMeans(aperm(data, c(3L, 1L, 2L))))
## [1] TRUE

microbenchmark::microbenchmark(
  "mean - base R" = apply(data, 1:2, mean),
  "mean - data.table" = DT[, .("mean_value" = mean(value)), by = .(lat, lon)],
  "mean - base colMeans" = colMeans(aperm(data, c(3L, 1L, 2L)))
)

## Unit: milliseconds
##                  expr     min       lq      mean  median      uq      max neval
##         mean - base R 48.6293 52.30380 57.611552 54.8006 61.2880 146.4658   100
##     mean - data.table  7.5610  8.95255 13.004569  9.7459 17.4888  28.7324   100
##  mean - base colMeans  6.1713  6.45505  8.421273  6.6429  7.7849  99.3952   100

For efficient rolling means, zoo::roll_applyr() is intuitive but not fast. We can use a number of package solutions such as data.table::frollmean() or RcppRoll::roll_mean() . See also here for various methods for rolling averages:

Calculating moving average

Or @JanGorecki's question on performance:

Adaptive moving average - top performance in R

I profiled @Matti Pastell's solution and @pipefish's solution from the first link in addition to using data.table::frollmean() with apply:

##@Matti Pastell's
base_ma = function(x, n = 5){stats::filter(x, rep(1 / n, n), sides = 2)}
##@pipefish's
base_ma2 = function(x, n = 5L) {
  cx <- c(0,cumsum(x))
  (cx[(n+1):length(cx)] - cx[1:(length(cx) - n)]) / n
}

microbenchmark(
  "rollmean - apply zoo" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = rolling_mean)
  },
  "rollmean - apply dt.frollmean" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = frollmean, n = 3L)
  },
  "rollmean - data.table" = {
    DT[, rollmean := frollmean(value, n = 3), by = .(lat, lon)]
  },
  "rollmean - apply stats::filter" = {
    apply(X = data,
        MARGIN = 1:2,
        FUN = base_ma, n = 3L)
  },
  "rollmean - apply w_cumsum" = {
    apply(X = data,
          MARGIN = 1:2,
          FUN = base_ma2, n = 3L)
  },
  times = 1L
)
Unit: milliseconds
                           expr       min        lq      mean    median        uq       max neval
           rollmean - apply zoo 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218 2823.6218     1
  rollmean - apply dt.frollmean  331.7787  331.7787  331.7787  331.7787  331.7787  331.7787     1
          rollmean - data.table  358.0787  358.0787  358.0787  358.0787  358.0787  358.0787     1
 rollmean - apply stats::filter  435.5238  435.5238  435.5238  435.5238  435.5238  435.5238     1
      rollmean - apply w_cumsum  148.7428  148.7428  148.7428  148.7428  148.7428  148.7428     1
Cole
  • 11,130
  • 1
  • 9
  • 24
  • 1
    That was for my a really unexpected answer and very interesting! But If I did understand the code, most of the calculus can be used only for computing the mean but are not suitable to calculate other statistics such as standard deviation for instance. Maybe I miss writing the issue, but what I am asking is which is faster for rolling functions in 3D arrays, or function over each 2D of the 3D (as apply with margin 1:2). In this way, In this way, ```data.table``` is the most optimal. Isn't it ? – Santiago I. Hurtado Oct 24 '20 at 14:09
  • 1
    There are different parts. For the ```mean()``` part, I think ```matrixStats``` would allow you to have all of the same functions that are optimized in ```data.table``` such as ```colMedians()```. For functions that aren't optimized internally for ```data.table``` it would be interesting to see which is faster. For ```rollapply```, I think sticking with arrays is fine, you just need to be sure the chosen function is performant. I'll see about editing this answer with more investigation in a few days. – Cole Oct 24 '20 at 16:08
  • 2
    Sticking with arrays is fine if you don't have NAs, so when you have complete set of values for combination of all dimensions (same as your example). The more NAs (more sparse array) the better to use DT because you don't store NAs and don't need to "skip" them. Also note that `frollapply` is slow, only rolling mean and sum are implemented "the fast way". – jangorecki Oct 25 '20 at 16:16
  • I have all complete arrays since they are model outputs. I am searching a little bit more about implementations with ```RcppArmadillo``` that seems to be really fast. – Santiago I. Hurtado Oct 26 '20 at 20:48
  • I think using arrays makes since - you can easily subset matrix slices. The NA storage also does not make sense if you want to do a rolling calculation - we would need those NAs to have the proper padding for our windows. And note, if we you are talking [tag:Rcpp], using ```apply(data, MARGIN = 1:2, fun = special_Rcpp_fx)``` would likely be less learning curve compared to [tag:RcppArmadillo] and provide pretty good performance very quickly. – Cole Oct 27 '20 at 01:20