9

I am working with multidimensional array both on R and MATLAB, these arrays have five dimensions (total of 14.5M of elements). I have to remove a dimension applying an arithmetic mean on it and I discovered an amazing difference of performances using the two softwares.

MATLAB:

>> a = rand([144  73  10   6  23]);
>> tic; b = mean(a,3); toc
Elapsed time is 0.014454 seconds.

R:

> a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))
> start <- Sys.time (); b = apply(a, c(1,2,4,5), mean); Sys.time () - start
Time difference of 1.229083 mins

I know that apply function is slow because is something like a general purpose function, but I don't know how to deal with this problem because this difference of performances is really a big limit for me. I tried to search for a generalization of colMeans/rowMeans functions but I didn't succeed.

EDIT I'll show a little sample matrix:

> dim(a)
[1] 2 4 3
> dput(aa)
structure(c(7, 8, 5, 8, 10, 11, 9, 9, 6, 12, 9, 10, 12, 10, 14, 
12, 7, 9, 8, 10, 10, 9, 8, 6), .Dim = c(2L, 4L, 3L))
a_mean = apply(a, c(2,3), mean)
> a_mean
     [,1] [,2] [,3]
[1,]  7.5  9.0  8.0
[2,]  6.5  9.5  9.0
[3,] 10.5 11.0  9.5
[4,]  9.0 13.0  7.0

EDIT (2):

I discovered that applying sum function and then dividing by the size of the removed dimension is definitely faster:

> start <- Sys.time (); aaout = apply(aa, c(1,2,4,5), sum); Sys.time () - start
Time difference of 5.528063 secs
shoelzer
  • 10,648
  • 2
  • 28
  • 49
Matteo De Felice
  • 1,488
  • 9
  • 23

2 Answers2

23

In R, apply is not the right tool for the task. If you had a matrix and needed the row or column means, you would use the much much faster, vectorized rowMeans and colMeans. You can still use these for a multi-dimensional array but you need to be a little creative:

Assuming your array has n dimensions, and you want to compute means along dimension i:

  1. use aperm to move the dimension i to the last position n
  2. use rowMeans with dims = n - 1

Similarly, you could:

  1. use aperm to move the dimension i to the first position
  2. use colMeans with dims = 1

a <- array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

means.along <- function(a, i) {
  n <- length(dim(a))
  b <- aperm(a, c(seq_len(n)[-i], i))
  rowMeans(b, dims = n - 1)
}

system.time(z1 <- apply(a, c(1,2,4,5), mean))
#    user  system elapsed 
#  25.132   0.109  25.239 
system.time(z2 <- means.along(a, 3))
#    user  system elapsed 
#   0.283   0.007   0.289 

identical(z1, z2)
# [1] TRUE
flodel
  • 87,577
  • 21
  • 185
  • 223
  • Exactly. Always use vectorized funcs over loops or *apply whenever possible. – Carl Witthoft Sep 05 '13 at 11:31
  • This should *definitely* be the accepted answer. +1 for a great explanation of the generalised case. I was looking `aperm` but just could not get it right. Thanks! – Simon O'Hanlon Sep 05 '13 at 12:25
  • For completeness' sake, `rowMeans` does not use the same algorithm `mean` uses; the first is naive single-pass accumulate and divide; the latter has a second pass to improve numerical stability. – Ferdinand.kraft Sep 05 '13 at 19:00
  • Can you explain more why you need `aperm`? I just found that `all(colMeans(a, dims=1) == means.along(a,1))` is true. I.e. `colMeans()` by itself did what I needed (on a 3D array). Does `means.along()` handle cases that colMeans() or rowMeans() would not be able to by themselves? – Darren Cook Aug 21 '16 at 18:21
  • 1
    @DarrenCook. While doing their summation work, `colMeans` will remove the first `n` dimensions of an array whereas `rowMeans` will remove the last `n` dimensions of an array (where you have control over the value of `n` via the `dims` argument). So by themselves, you see how they are not capable of handling OP's case where he wants to remove the 3rd dimension of a 5d array. Hence the need for `aperm` to move that 3rd dimension either in 1st position (so that `colMeans` can be used) or in 5th position (so that `roMeans` can be used). – flodel Aug 21 '16 at 19:29
  • @flodel Ah - I see! I thought it might be something special about 3D arrays, but I see the fact that I've not needed aperm is because I've only needed to work on either the 1st or 3rd dimension! (It might be worth a note in your answer to say if you need to work on dimension 1 or dimension N, of a N-dimension array, that aperm can be skipped.) (I did just notice your edit, which also helps make it clearer.) – Darren Cook Aug 21 '16 at 19:34
5

mean is particularly slow because of S3 method dispatch. This is faster:

set.seed(42)
a = array(data = runif(144*73*6*23*10), dim = c(144,73,10,6,23))

system.time({b = apply(a, c(1,2,4,5), mean.default)})
# user  system elapsed 
#16.80    0.03   16.94

If you don't need to handle NAs you can use the internal function:

system.time({b1 = apply(a, c(1,2,4,5),  function(x) .Internal(mean(x)))})
# user  system elapsed 
# 6.80    0.04    6.86

For comparison:

system.time({b2 = apply(a, c(1,2,4,5),  function(x) sum(x)/length(x))})
# user  system elapsed 
# 9.05    0.01    9.08 

system.time({b3 = apply(a, c(1,2,4,5),  sum)
             b3 = b3/dim(a)[[3]]})
# user  system elapsed 
# 7.44    0.03    7.47

(Note that all timings are only approximate. Proper benchmarking would require running this repreatedly, e.g., using one of the bechmarking packages. But I'm not patient enough for that right now.)

It might be possible to speed this up with an Rcpp implementation.

Roland
  • 127,288
  • 10
  • 191
  • 288