7

This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation.

I'm simulating dispersal of organisms amongst multiple populations. Individuals from population j disperse to population i with probability p[i, j]. Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6) to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom:

set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))

#      [,1]
# [1,]    0
# [2,]    3
# [3,]    7

We can extend this to consider n source populations:

set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)

Above, p is a matrix of probabilities of moving from one population (column) to another (row), and X is a vector of initial population sizes. The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with:

sapply(seq_len(ncol(p)), function(i) {
  rmultinom(1, X[i], p[, i])  
})

#      [,1] [,2] [,3]
# [1,]   19   42   11
# [2,]    8   18   43
# [3,]   68    6    8

where the value of the element at the ith row and jth column is the number of individuals moving from population j to population i. The rowSums of this matrix give the new population sizes.

I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. The following small example achieves this, but is inefficient with larger problems. The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances.

X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)

apply(sapply(apply(X, 2, function(x) {
  lapply(seq_len(ncol(p)), function(i) {
    rmultinom(1, x[i], p[, i])  
  })
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)

#      [,1] [,2] [,3] [,4] [,5]
# [1,]   79   67   45   28   74
# [2,]   92   99   40   19   52
# [3,]   51   45   16   21   35

Is there a way to better vectorise this problem?

jbaums
  • 27,115
  • 5
  • 79
  • 119
  • 1
    as usual, the speed answer is: have you tried with `Rcpp`? Try http://www.gnu.org/software/gsl/manual/html_node/The-Multinomial-Distribution.html – Gary Weissman Apr 16 '14 at 01:15
  • 1
    @GaryWeissman: Not yet - I'm curious as to whether pure R improvements are possible before venturing down that path. – jbaums Apr 16 '14 at 01:16
  • Thanks for the link, @GaryWeissman - do you have any idea whether that implementation would be faster/better than using `Environment stats("package:stats"); Function rmultinom = stats["rmultinom"];` within `Rcpp` as shown [here](http://lists.r-forge.r-project.org/pipermail/rcpp-devel/2011-June/002393.html)? – jbaums Apr 16 '14 at 02:01
  • I just tested the example you mentioned against the native `rmultinom` function and the latter is an order of magnitude faster. Perhaps the speedup is not so great unless Rcpp would loop through your different populations faster than your nested apply. – Gary Weissman Apr 16 '14 at 03:00
  • it seems that it is a iterative process, I don't think there will be a easy way to parallelize it. Of course, we can always optimize it using `Rcpp`. – Randy Lai Apr 16 '14 at 03:04
  • 1
    I gather you have enough computational answers. Might be worth keeping in mind though that if you have speed bottlenecks, it might just be easiest to go the analytic route. What you describe is a discrete markov chain, so it is possible to quickly get the equilibrium probabilities. Additionally, for large N, the multinomial is nearly multivariate normal and can be approximated that way. So if slow, I would go algorithm implementation over coding of algorithm implementation for this problem, as your question will likely be easier to answer. – evolvedmicrobe Apr 16 '14 at 05:39
  • @evolvedmicrobe, originally, I also thought that it is markov. But later I discovered that it is not. What jbaums wants are the random samples given different initial states. – Randy Lai Apr 16 '14 at 06:09
  • @RandyLai it is markovian, you are transitioning from one state (unsampled things), to another (sampled things). The first step of this process as a known pmf, additional steps can be analyzed analytically as well. Why not markovian? – evolvedmicrobe Apr 16 '14 at 06:57
  • I meant the columns of the program output are not markovian. They are just the random samples from different initial states. – Randy Lai Apr 16 '14 at 07:00
  • Thanks for your input, @evolvedmicrobe. As Randy suggested, I'm not interested in equilibria but rather in the individual random samples themselves. – jbaums Apr 16 '14 at 07:13

3 Answers3

7

This is a RcppGSL implementation of multi-multinomial. However, it requires you to install gsl independently....which may not be very practical.

// [[Rcpp::depends(RcppGSL)]]

#include <RcppGSL.h>
#include <gsl/gsl_rng.h>
#include <gsl/gsl_randist.h>
#include <unistd.h>            // getpid

Rcpp::IntegerVector rmn(unsigned int N, Rcpp::NumericVector p, gsl_rng* r){

    size_t K = p.size();

    Rcpp::IntegerVector x(K);
    gsl_ran_multinomial(r, K, N, p.begin(), (unsigned int *) x.begin());
    return x;             // return results vector
}

Rcpp::IntegerVector gsl_mmm_1(Rcpp::IntegerVector N, Rcpp::NumericMatrix P, gsl_rng* r){
    size_t K = N.size();
    int i;
    Rcpp::IntegerVector x(K);
    for(i=0; i<K; i++){
        x += rmn(N[i], P(Rcpp::_, i), r);
    }
    return x;
}

// [[Rcpp::export]]
Rcpp::IntegerMatrix gsl_mmm(Rcpp::IntegerMatrix X_, Rcpp::NumericMatrix P){
    int j;
    gsl_rng * r = gsl_rng_alloc (gsl_rng_mt19937);
    long seed = rand()/(((double)RAND_MAX + 1)/10000000) * getpid();
    gsl_rng_set (r, seed);
    Rcpp::IntegerMatrix X(X_.nrow(), X_.ncol());
    for(j=0; j<X.ncol(); j++){
        X(Rcpp::_, j) = gsl_mmm_1(X_(Rcpp::_,j), P, r);
    }
    gsl_rng_free (r);
    return X;
}

I also compare it with a pure R implementation and jbaums's version

library(Rcpp)
library(microbenchmark)
sourceCpp("gsl.cpp")

P = matrix(c(c(0.1,0.2,0.7),c(0.3,0.3,0.4),c(0.5,0.3,0.2)),nc=3)
X = matrix(c(c(30,40,30),c(20,40,40)), nc=2)

mmm = function(X, P){
    n = ncol(X)
    p = nrow(X)
    Reduce("+", lapply(1:p, function(j) {
        Y = matrix(0,p,n)
        for(i in 1:n) Y[,i] = rmultinom(1, X[j,i], P[,j])
        Y
    }))
}

jbaums = function(X,P){
    apply(sapply(apply(X, 2, function(x) {
      lapply(seq_len(ncol(P)), function(i) {
        rmultinom(1, x[i], P[, i])
      })
    }), function(x) do.call(cbind, x), simplify='array'), nrow(X), rowSums)
}
microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))

and this is the result

> microbenchmark(jbaums(X,P), mmm(X,P), gsl_mmm(X, P))
Unit: microseconds
          expr     min       lq  median       uq     max neval
  jbaums(X, P) 165.832 172.8420 179.185 187.2810 339.280   100
     mmm(X, P)  60.071  63.5955  67.437  71.5775  92.963   100
 gsl_mmm(X, P)  10.529  11.8800  13.671  14.6220  40.857   100

The gsl version is about 6x faster than my pure R version.

Randy Lai
  • 3,084
  • 2
  • 22
  • 23
  • 1
    I found a bug in the gsl version and it is fixed now. – Randy Lai Apr 16 '14 at 07:19
  • Are the paths to headers just relative to the directory that `gsl.cpp` is in? And are they just provided at the top of that file itself, or would I need to do something else to register them with `Rcpp`? Relative `Rcpp` rookie here. – jbaums Apr 16 '14 at 10:53
  • 1
    No, they should be in somewhere listed in the system `PATH` variable. When you install `gsl`, the corresponding location should be added to the `PATH`. If you can install `RcppGSL`, it means that `Rcpp` can find `gsl`. – Randy Lai Apr 16 '14 at 17:38
  • I've not yet been able to install gsl on Windows and test this, but your benchmark suggests it will be suitable. – jbaums Apr 21 '14 at 22:12
  • you can consider my R implementation if you cannot get gsl to work – Randy Lai Apr 21 '14 at 22:14
  • Thanks - yes, the `Reduce` approach will also be useful. It'll be interesting to see how each scales with the number of multinomial trials. The c++ version in my answer below is a slight improvement over your `Reduce`, but quickly becomes slower than my original R version as the number of trials increases. – jbaums Apr 21 '14 at 22:16
1

For example:

# make the example in Rcpp you mention:
library(Rcpp)
library(inline)
src <- 'Environment stats("package:stats");
Function rmultinom = stats["rmultinom"];
NumericVector some_p(1000, 1.0/1000);
return(rmultinom(1,1, some_p));'

fx <- rcpp(signature(), body=src)

# now compare the two
library(rbenchmark)
benchmark(fx(),rmultinom(1,1,c(1000,1/1000)),replications=10000)

#                            test replications elapsed relative user.self sys.self user.child sys.child
#    1                       fx()        10000   1.126   13.901     1.128        0          0         0
#    2 rmultinom(1, 1, c(1/1000))        10000   0.081    1.000     0.080        0          0         0
Gary Weissman
  • 3,557
  • 1
  • 18
  • 23
  • 1
    running R functions in C env usually is much slower than running R functions in R env. – Randy Lai Apr 16 '14 at 03:06
  • Thanks for this benchmark, and also for your tip, @RandyLai. Gary - how would the example you pointed to [here](http://www.gnu.org/software/gsl/manual/html_node/The-Multinomial-Distribution.html) be implemented with Rcpp? – jbaums Apr 16 '14 at 03:12
  • I can't seem to get the GSL libraries to link, but it would look something like this: `fx2 <- cxxfunction(signature(), body='return(gsl_ran_multinomial(1,1,1000,1.0/1000))', includes='#include ', plugin='RcppGSL')` – Gary Weissman Apr 16 '14 at 03:52
  • @jbaums, do you know how to install gsl? Since RcppGSL doesn't come with gsl, you have to install it by yourself. – Randy Lai Apr 16 '14 at 04:15
  • @RandyLai: Thanks for the heads up. I haven't tried yet... seems possible on Windows, but sounds like it needs to be run through Cygwin? If so, I imagine the R terminal (or preferably RStudio) would have to be run through Cygwin also... would this work? Although ideally I'm looking for a simple cross-platform solution. – jbaums Apr 16 '14 at 05:15
  • i think there is windows pre-compiled binary version so that you don't have to install Cygwin. What you need is to provide Rcpp the paths to the header files and the library files. Linux and mac users have easier life. – Randy Lai Apr 16 '14 at 06:18
  • @jbaums you may also want to check the pure R version in my answer. – Randy Lai Apr 16 '14 at 06:21
1

I've discovered that the BH package brings boost libraries to the table. This enables the following, which produces the same output as @RandyLai's gsl_mmm and as the code in my question above. (I believe enabling c++11 support should make random available without BH.)

// [[Rcpp::depends(BH)]]
#include <Rcpp.h>

#include <boost/random.hpp>
#include <boost/random/mersenne_twister.hpp>
#include <boost/random/discrete_distribution.hpp>

using namespace Rcpp;

typedef boost::mt19937 RNGType;
RNGType rng(123);


NumericVector rowSumsC(IntegerMatrix x) {
  int nrow = x.nrow(), ncol = x.ncol();
  IntegerVector out(nrow);

  for (int i = 0; i < nrow; i++) {
    double total = 0;
    for (int j = 0; j < ncol; j++) {
      total += x(i, j);
    }
    out[i] = total;
  }
  return wrap(out);
}

// [[Rcpp::export]]
IntegerMatrix rmm(IntegerMatrix X, NumericMatrix P) {
  int niter = X.ncol(), nx = X.nrow();
  IntegerMatrix out(nx, niter);
  for (int j = 0; j < niter; j++) {
    IntegerMatrix tmp(nx, nx);
    for (int i = 0; i < nx; i++) {
      for (int n = 0; n < X(i, j); n++) {
        boost::random::discrete_distribution<> dist(P(_, i));
        tmp(dist(rng), i)++;
      }
    }
    out(_, j) = rowSumsC(tmp);
  }
  return out;
}

rowSumsC provided by @hadley, here.

However, on my machine, this is considerably slower than Randy's gsl_mmm, and indeed slower than my R version when there are many trials. I suspect this is due to inefficient coding, but boost's discrete_distribution also performs each multinomial trial individually whereas this process appears vectorised when using gsl. I'm new to c++ so not sure whether this can be made more efficient.

P <- matrix(c(c(0.1, 0.2, 0.7), c(0.3, 0.3, 0.4), c(0.5, 0.3, 0.2)), nc=3)
X <- matrix(c(c(30, 40, 30), c(20, 40, 40)), nc=2)
library(BH)
microbenchmark(jbaums(X, P), rmm(X, P))

# Unit: microseconds
#          expr     min       lq  median       uq     max neval
#  jbaums(X, P) 124.988 129.5065 131.464 133.8735 348.763   100
#     rmm(X, P)  59.031  60.0850  62.043  62.6450 117.459   100
jbaums
  • 27,115
  • 5
  • 79
  • 119