5

Just wondering if anyone has ever encountered the problem where he/she needs to randomly draw from a very high dimensional multivariate normal distribution (say dimension = 10,000), as the rmvnorm function of the mvtnorm package is impractical for that.

I know this article has an Rcpp implementation for the dmvnorm function of the mvtnorm package, so I was wondering if something equivalent exists for rmvnorm?

jay.sf
  • 60,139
  • 8
  • 53
  • 110
user1701545
  • 5,706
  • 14
  • 49
  • 80

1 Answers1

9

Here's a quick comparison of mvtnorm::rmvnorm and an Rcpp implementation given here by Ahmadou Dicko. The times presented are for 100 draws from a multivariate normal distribution with dimension ranging from 500 to 2500. From the graph below you can probably infer the time required for dimension of 10000. Times include the overhead of generating the random mu vector and the diag matrix, but these are consistent across approaches and are trivial for the dimensions in question (e.g. 0.2 sec for diag(10000)).

library(Rcpp)
library(RcppArmadillo)
library(inline)
library(mvtnorm)

code <- '
using namespace Rcpp;
int n = as<int>(n_);
arma::vec mu = as<arma::vec>(mu_);
arma::mat sigma = as<arma::mat>(sigma_);
int ncols = sigma.n_cols;
arma::mat Y = arma::randn(n, ncols);
return wrap(arma::repmat(mu, 1, n).t() + Y * arma::chol(sigma));
'

rmvnorm.rcpp <- 
  cxxfunction(signature(n_="integer", mu_="numeric", sigma_="matrix"), code,
              plugin="RcppArmadillo", verbose=TRUE)

rcpp.time <- sapply(seq(500, 5000, 500), function(x) {
  system.time(rmvnorm.rcpp(100, rnorm(x), diag(x)))[3]  
})

mvtnorm.time <- sapply(seq(500, 2500, 500), function(x) {
  system.time(rmvnorm(100, rnorm(x), diag(x)))[3]  
})


plot(seq(500, 5000, 500), rcpp.time, type='o', xlim=c(0, 5000),
     ylim=c(0, max(mvtnorm.time)), xlab='dimension', ylab='time (s)')

points(seq(500, 2500, 500), mvtnorm.time, type='o', col=2)

legend('topleft', legend=c('rcpp', 'mvtnorm'), lty=1, col=1:2, bty='n')

enter image description here

jbaums
  • 27,115
  • 5
  • 79
  • 119
  • 1
    Thanks a lot jbaums. What would I need to add to this code in order for it to use openmp (multiple cores)? – user1701545 Mar 30 '14 at 18:14
  • I don't have experience with openmp, but the bottleneck here is the Cholesky decomposition, so [this](http://stackoverflow.com/questions/22479258/cholesky-decomposition-with-openmp) might be helpful. [This](http://stackoverflow.com/questions/22389117/parallelisation-in-armadillo/22584873#22584873) could also be useful. – jbaums Mar 30 '14 at 21:40
  • [jbaums](https://stackoverflow.com/users/489704/jbaums) I initially got even better performance than you display above. However, as I increased the number of samples taken from 100 to 10000 the Rcpp slowed enormously (while the mvtnorm effectively did not change). I cannot figure out what is going on. Memory management? Do you have any ideas? – josephwb Feb 07 '18 at 14:21