This question pertains to efficient sampling from multinomial distributions with varying sample sizes and probabilities. Below I describe the approach I have used, but wonder whether it can be improved with some intelligent vectorisation.
I'm simulating dispersal of organisms amongst multiple populations. Individuals from population j
disperse to population i
with probability p[i, j]
. Given an initial abundance of 10 in population 1, and probabilities of dispersal c(0.1, 0.3, 0.6)
to populations 1, 2, and 3, respectively, we can simulate the dispersal process with rmultinom
:
set.seed(1)
rmultinom(1, 10, c(0.1, 0.3, 0.6))
# [,1]
# [1,] 0
# [2,] 3
# [3,] 7
We can extend this to consider n
source populations:
set.seed(1)
n <- 3
p <- replicate(n, diff(c(0, sort(runif(n-1)), 1)))
X <- sample(100, n)
Above, p
is a matrix of probabilities of moving from one population (column) to another (row), and X
is a vector of initial population sizes. The number of individuals dispersing between each pair of populations (and those remaining where they are) can now be simulated with:
sapply(seq_len(ncol(p)), function(i) {
rmultinom(1, X[i], p[, i])
})
# [,1] [,2] [,3]
# [1,] 19 42 11
# [2,] 8 18 43
# [3,] 68 6 8
where the value of the element at the i
th row and j
th column is the number of individuals moving from population j
to population i
. The rowSums
of this matrix give the new population sizes.
I'd like to repeat this many times, with constant probability matrix but with varying (pre-defined) initial abundances. The following small example achieves this, but is inefficient with larger problems. The resulting matrix gives the post-dispersal abundance in each of three populations for each of 5 simulations for which population had different initial abundances.
X <- matrix(sample(100, n*5, replace=TRUE), nrow=n)
apply(sapply(apply(X, 2, function(x) {
lapply(seq_len(ncol(p)), function(i) {
rmultinom(1, x[i], p[, i])
})
}), function(x) do.call(cbind, x), simplify='array'), 3, rowSums)
# [,1] [,2] [,3] [,4] [,5]
# [1,] 79 67 45 28 74
# [2,] 92 99 40 19 52
# [3,] 51 45 16 21 35
Is there a way to better vectorise this problem?