13

I am trying to find a way to generate correlated random numbers from several binomial distributions.

I know how to do it with normal distributions (using MASS::mvrnorm), but I did not find a function applicable to binomial responses.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453
Arnaud
  • 377
  • 1
  • 2
  • 11
  • You can use package `bindata`, as nicely demo'd here: https://stat.ethz.ch/pipermail/r-help/2007-July/135575.html . (That link was on the first page returned by a Google search for 'R simulate correlated binomial variable' ...) – Josh O'Brien May 10 '12 at 13:47
  • Thanks Josh, but I need binomial not binary data ! – Arnaud May 10 '12 at 14:07
  • 2
    @Arnaud - granted I've not had any sort of caffeine or stimulant this morning, but isn't a binomial distribution a discrete distribution where the only acceptable values are "yes/no", "pass/fail", "TRUE/FALSE", in other words binary? That's what [Wikipedia seems to think too.](http://en.wikipedia.org/wiki/Binomial_distribution) – Chase May 10 '12 at 14:09
  • @chase - I agree that binary and binomial are based on "yes/no", "1/0" ...etc values, but binary data can take only two values coded 0 and 1, binomial data is a count of n successes out of x trials (i.e. a proportion of success).But following this idea ... Do binomial variables calculated as the proportion of success in samples of correlated Bernoulli variables are correlated ? – Arnaud May 10 '12 at 14:36
  • @Arnaud - I think you may be caught up in the semantics or wording differences. In your binomial example, "n successes out of x trials" means you could be counting the number of red marbles out of a bag. So every red marble gets a value of 1, all other colors have a value of 0. More generally, you can convert your "success" to a value of 1, and failure as a value of 0...or vice versa if that makes more sense for whatever it is you are counting. – Chase May 10 '12 at 15:11
  • I'm not sure I follow the last questions there. Bernoulli trials are simply a single experiment with a binary outcome, "success or failure". When you take a series of *independent* Bernoulli trials with the same probability of success, that is typically referred to as a "binomial distribution". – Chase May 10 '12 at 15:11
  • I am searching for a way to do it using R ... so I wonder how I can use the rmvbin function to obtain what I want. I need 20 random numbers sampled from 20 different binomial distributions but with a defined correlation between those numbers. – Arnaud May 10 '12 at 15:46
  • Point taken. Apparently I *still* hear 'binary' when someone says 'binomial' -- a hard habit to break ;). The solution I just now posted below could probably be greatly speeded up, but it should give you a good start. – Josh O'Brien May 10 '12 at 17:50

3 Answers3

11

You can generate correlated uniforms using the copula package, then use the qbinom function to convert those to binomial variables. Here is one quick example:

library(copula)

tmp <- normalCopula( 0.75, dim=2 )
x <- rcopula(tmp, 1000)
x2 <- cbind( qbinom(x[,1], 10, 0.5), qbinom(x[,2], 15, 0.7) )

Now x2 is a matrix with the 2 columns representing 2 binomial variables that are correlated.

Greg Snow
  • 48,497
  • 6
  • 83
  • 110
  • Thanks again Greg ... after your help with optim on the R help, you save me again ! – Arnaud May 10 '12 at 18:50
  • 5
    This is an interesting idea, but it doesn't return variables with the desired correlation. (For instance, I calculated sample correlation coefficients for 100 replicates of the above code: the average correlation was 0.724, with just 5 of the correlation coefficients greater than 0.75). – Josh O'Brien May 10 '12 at 19:04
  • 3
    @JoshO'Brien, finding a general closed form solution to generating data (other than normal) with a specified correlation is not simple. Even working out an exact method for the stated problem would possibly constitute a masters thesis. But a simple approximation is to simulate like you did, then adjust the value given to the copula, simulate again, etc. until you find a value that is close enough (the original question said that they had to be correlated, not what the correlation coef should be). – Greg Snow May 10 '12 at 19:23
  • @GregSnow -- Thanks for your response (+1). It's great to get professional confirmation that this isn't a trivial problem, as it sure didn't feel like one, the more I pondered it. (I also like your suggested solution of adjusting the copula to get the desired rho.) – Josh O'Brien May 10 '12 at 19:36
  • @Josh O'Brien and Greg Snow - Thanks both for your help and the precision. I made some simulations, and it seems that the higher the number of trials defined in qbinom, the nearer the final correlation is from the correlation defined in normalCopula – Arnaud May 11 '12 at 14:16
  • @GregSnow I just tried out your example above and `rcopula()` seems no more to be working. What should we use instead? – jay.sf Oct 29 '17 at 20:18
  • 2
    @jaySf, it looks like the copula package has changed. Use `rCopula(1000, tmp)` instead. – Greg Snow Oct 30 '17 at 16:44
9

A binomial variable with n trials and probability p of success in each trial can be viewed as the sum of n Bernoulli trials each also having probability p of success.

Similarly, you can construct pairs of correlated binomial variates by summing up pairs of Bernoulli variates having the desired correlation r.

require(bindata)

# Parameters of joint distribution
size <- 20
p1 <- 0.5
p2 <- 0.3
rho<- 0.2

# Create one pair of correlated binomial values
trials <- rmvbin(size, c(p1,p2), bincorr=(1-rho)*diag(2)+rho)
colSums(trials)

# A function to create n correlated pairs
rmvBinomial <- function(n, size, p1, p2, rho) {
    X <- replicate(n, {
             colSums(rmvbin(size, c(p1,p2), bincorr=(1-rho)*diag(2)+rho))
         })
    t(X)
}
# Try it out, creating 1000 pairs
X <- rmvBinomial(1000, size=size, p1=p1, p2=p2, rho=rho)
#     cor(X[,1], X[,2])
# [1] 0.1935928  # (In ~8 trials, sample correlations ranged between 0.15 & 0.25)

It's important to note that there are many different joint distributions that share the desired correlation coefficient. The simulation method in rmvBinomial() produces one of them, but whether or not it's the appropriate one will depend on the process that's generating you data.

As noted in this R-help answer to a similar question (which then goes on to explain the idea in more detail) :

while a bivariate normal (given means and variances) is uniquely defined by the correlation coefficient, this is not the case for a bivariate binomial

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Thanks a lot Josh. I modified your script to allow a larger number of binomial distributions. However, as indicated in http://stat.ethz.ch/pipermail/r-help/2007-July/135575.html, rho is bounded below and above by some function of the marginal probabilities (the function fail for rho = 0.8). The use of odd ratio seems to be the solution, but ... do you know how to generalize the function proposed that allows converting odds ratio to valid binary correlation for more than 2 distributions ? – Arnaud May 10 '12 at 18:33
  • @Josh I've asked a related question, perhaps you might want to take a look at it? https://stackoverflow.com/questions/47006881/how-to-generate-a-binomial-vector-of-n-correlated-items – jay.sf Nov 04 '17 at 08:46
0

A matrix with correlated binary data can also be iterated by a genetic algorithm, e.g. implemented in the R-package 'RepeatedHighDim' (https://github.com/jkruppa/RepeatedHighDim). The algorithm is described here https://www.sciencedirect.com/science/article/abs/pii/S0010482517303499

library(RepeatedHighDim)
X0 <- start_matrix(p = c(0.5, 0.3), k = 1000) # sample size k
R <- diag(2)
R[1,2] = 0.2
R[2,1] = 0.2
X1 <- iter_matrix(X0, R = R, T = 10000, e.min = 0.00001)$Xt
cor(X1)

The package implements also two other algorithms:

X2 = rmvbinary_EP(n = 1000, R = R, p = c(0.5, 0.3))
X3 = rmvbinary_QA(n = 1000, R = R, p = c(0.5, 0.3))
cor(X2)
cor(X3)