5

Supposing I want 2 vectors of binary data with specified phi coefficients, how could I simulate it with R?

For example, how can I create two vectors like x and y of specified vector length with the cor efficient of 0.79

> x = c(1,  1,  0,  0,  1,  0,  1,  1,  1)
> y = c(1,  1,  0,  0,  0,  0,  1,  1,  1)
> cor(x,y)
[1] 0.7905694
RNA
  • 146,987
  • 15
  • 52
  • 70
  • 1
    duplicate? http://stackoverflow.com/a/10540234/2105757 – ndoogan Apr 18 '13 at 17:25
  • @ndoogan -- Well, this is asking for binary data, not binomial, so it's slightly different. – Josh O'Brien Apr 18 '13 at 17:32
  • @JoshO'Brien What is the difference between a binomial model of, for example, a single coin flip, and a random binary model? – ndoogan Apr 18 '13 at 17:33
  • @ndoogan -- Only that certain simulation methods will work for the special case (binary) that won't work so well for the more general (binomial). I'll add an answer showing that. – Josh O'Brien Apr 18 '13 at 17:35

1 Answers1

14

The bindata package is nice for generating binary data with this and more complicated correlation structures. (Here's a link to a working paper (warning, pdf) that lays out the theory underlying the approach taken by the package authors.)

In your case, assuming that the independent probabilities of x and y are both 0.5:

library(bindata)

## Construct a binary correlation matrix
rho <- 0.7905694
m <- matrix(c(1,rho,rho,1), ncol=2)   

## Simulate 10000 x-y pairs, and check that they have the specified
## correlation structure
x <- rmvbin(1e5, margprob = c(0.5, 0.5), bincorr = m) 
cor(x)
#           [,1]      [,2]
# [1,] 1.0000000 0.7889613
# [2,] 0.7889613 1.0000000
Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Seems like a fine answer. But I still don't see why a multivariate binomial model (with trials = 1) is any different. I'll grant that I should have just said Bernoulli model. But that's literally binomial with trials = 1. – ndoogan Apr 18 '13 at 17:44
  • @ndoogan. It's not any different. Did you notice, though, that the accepted answer in the question you linked to didn't actually generate binomial data with the specified correlation? I just wanted to highlight that for the special case of binary data (or binomial data with trials=1, if you prefer) there **are** nice existing tools. – Josh O'Brien Apr 18 '13 at 17:53
  • @JoshO'Brien: I noticed non-negligible deviation of the simulated correlation coefficients and the specified in many runs. Is this the best/closest we can do in simulation? – RNA Apr 18 '13 at 18:00
  • @RNA -- The simulated correlations don't appear to me to be systematically biased from 0.7905, so I think that's just expected sampling variation. – Josh O'Brien Apr 18 '13 at 18:44