8

This question can be considered related to this one, that helped me to improve the R performances in computing the mean on a big array. Unfortunately, in this case I'm trying to apply something more complex (like a quantile calculation).

I have a 4-D array with more than 40 millions of elements and I want to calculate the 66th percentile on a specific dimension. Here there is the MATLAB code:

> n = randn(100, 50, 100, 20);
> tic; q = quantile(n, 0.66, 4); toc
Elapsed time is 0.440824 seconds.

Let's do something similar in R.

> n = array(rnorm(100*50*100*20), dim = c(100,50,100,20))
> start = Sys.time(); q = apply(n, 1:3, quantile, .66); print(Sys.time() - start)
Time difference of 1.600693 mins

I was aware of the better performances of MATLAB wrt R but in this case I don't know what to do. Probably I just need to wait 2 minutes instead of one second... I hope someone can suggest me any way to improve running times, anyway, thank you in advance...

UPDATE I've applied some of the suggestions into the comments and I've reduced the running time:

> start = Sys.time(); q = apply(n, 1:3, quantile, .66, names = FALSE); print(Sys.time() - start)
Time difference of 33.42773 secs

We're still far from the MATLAB performances but at least I've learnt something.

UPDATE I put here some advancements related to `quantile' function discussed here. The running time of same code I've shown above has passed from 33 to 5 seconds...

Community
  • 1
  • 1
Matteo De Felice
  • 1,488
  • 9
  • 23
  • 2
    Adding ``names=FALSE`` to the ``apply`` call (so it is passed on to ``quantile``) is three times faster on my machine. – orizon Apr 02 '14 at 07:50
  • Also the MATLAB/R comparison is not really a fair one because the default definition of quantile is different (and more complicated) in R. – orizon Apr 02 '14 at 07:51
  • 1
    Substituting ``quantile.default`` for ``quantile`` saves a further ~15%. – orizon Apr 02 '14 at 08:01
  • I do love `tic` `toc`, but in R, we can use `system.time(q <- apply(n, 1:3, quantile, .66))`. :) – jbaums Apr 02 '14 at 09:07
  • Do these return the same output? It looks like the MATLAB code does quantiles on the 4th dimension, while the R code does quantiles on dimensions 1-3. To @orizon's point, there's also a `type` argument to R's `quantile` function that may be useful. – Richard Herron Apr 02 '14 at 09:28
  • I love you guys...thank you for your interesting comments, I'll give a try adding `names=FALSE` and using `quantile.default`. – Matteo De Felice Apr 02 '14 at 10:16
  • @RichardHerron `apply` has a different sintax, if you specificy as margins the first three dimensions it means that you want to "keep" the first three dimensions, it's totally the opposite than how the MATLAB syntax works. And, yes, both R and MATLAB provide the "same" results (I've checked the output dimensions at least). – Matteo De Felice Apr 02 '14 at 10:21
  • 1) how do you check the results since it is random data ? – Karl Forner Apr 04 '14 at 12:15
  • 1
    I didn't check the results thoroughly because I am sure that MATLAB and R will give me the same results. I've just checked the dimensions and same output statistics. – Matteo De Felice Apr 05 '14 at 18:20

1 Answers1

5

The RcppOctave package calls the GNU Octave API functions; if you don't already know about GNU Octave, it is very similar to Matlab and aims for complete compatiility.

This is nearly as fast as Matlab direct...

library(RcppOctave)

set.seed(1)
n = array(rnorm(100*50*100*20), dim = c(100,50,100,20))

system.time( s <- octave_style_quantile(n, .66, 4) )
##    user  system elapsed 
##   0.526   0.048   0.574

# *R* `quantile` argument `type=5` is the method that matches matlab.
system.time( r <- apply(n, 1:3, quantile, .66, names = FALSE, type=5) )
##    user  system elapsed 
##  23.308   0.029  23.327

dim(r)
## [1] 100  50 100

identical(r,s)
## [1] TRUE

A fairly straight forward translation of Octave's quantile.m to R.

octave_style_quantile <- function (x, p=NULL, dim=NULL) {
  if ( is.null(p) ) p <- c(0.00, 0.25, 0.50, 0.75, 1.00)

  if ( is.null(dim) ) {
    ## Find the first non-singleton dimension.
    dim <- which(dim(x) > 1)[1];
  }

  stopifnot( is.numeric(x)||is.logical(x),
             is.numeric(p),
             dim <= length(dim(x)) )

  ## Set the permutation vector.
  perm <- seq_along(dim(x))
  perm[1] <- dim
  perm[dim] <- 1

  ## Permute dim to the 1st index.
  x <- aperm(x, perm);

  ## Save the size of the permuted x N-d array.
  sx = dim(x);

  ## Reshape to a 2-d array.
  dim(x) <- c( sx[1], prod(sx[-1]) );

  ## Calculate the quantiles.
  q = .CallOctave("quantile",x,p)

  ## Return the shape to the original N-d array.
  dim(q) <- c( length(p), sx[-1] )

  ## Permute the 1st index back to dim.
  q = aperm(q, perm);
  if( any(dim(q)==1) ) dim(q) <- dim(q)[-which(dim(q)==1)]
  q
}
Thell
  • 5,883
  • 31
  • 55