How to calculate marginal probabilities for generating correlated binary variables

Question

Let's say I want to create 50 binary variables of length 100 that are each correlated with each other.

For I create a correlation matrix with the specified rho's:

cor.mat <- matrix(0.2,nrow=50, ncol=50)
diag(cor.mat) <- 1

next I use rmvbin:

library(bindata)
rmvbin<-rmvbin(100, margprob=x, bincorr=cor.mat)

However, I'm not sure how to calculate the margprob argument. Can someone help?

SHould it be a vector of the sum of probabilities in each row and column?

Check out `?rmvbin` and look in the examples. There is an example of using `margprob`. — nograpes, Sep 18 '13 at 13:31
yes I saw the example, my problem is I don't know what values to fill in for my matrix. — user1984076, Sep 18 '13 at 13:33
is there something wrong with putting `x=rep(.5,50)`? That would set each of the individual RVs to be a fair coin flip. — Seth, Sep 26 '13 at 18:16

score 4 · Accepted Answer · answered Sep 27 '13 at 17:27

margprob should simply be a repeated vector of the probability that any single binary variable is 1, independent of the rest; call this value p. Assuming identically distributed variables (which given your correlation matrix seems to be the case), margprob=rep(p,50).

It should NOT be a vector of the sum of probabilities in each row and column, as the correlation matrix cannot be used to determine the marginal probabilities. If you're having trouble figuring out what the marginal probabilities for your random variables are, you will have to give more context for the problem and it would be a question more appropriate for math.stackexchange.com.

IRTFM · Answer 2 · 2013-09-30T02:07:32.770

I think the problem has been that people saw the solution as being either too simple or not properly specified. You don't actually calculate the marginal probabilities ... you specify them. Then the rmvbin function uses the specification of the marginal probabilities and the joint correlations to do the needed sampling to (on average) give joint distributions that match those specs.

library(bindata)
rmvbin<-rmvbin(100, margprob=rep(.5,50), bincorr=cor.mat)

> str(rmvbin)
 num [1:100, 1:50] 0 0 0 1 0 0 0 1 0 0 ...
 - attr(*, "dimnames")=List of 2
  ..$ : NULL
  ..$ : NULL

So to look at the sampling characteristics of this result, you can see what correlation there is with the first column:

Hmisc::describe(apply(rmvbin[,-1], 2, function(col) cor(col, rmvbin[,1]) ) )
apply(rmvbin[, -1], 2, function(col) cor(col, rmvbin[, 1])) 
      n missing  unique    Mean     .05     .10     .25     .50     .75     .90 
     49       0      38  0.2009 0.05886 0.09874 0.13309 0.19372 0.25208 0.29723 
    .95 
0.33772 

lowest : 0.03508 0.04013 0.08696 0.09874 0.10889
highest: 0.29942 0.32450 0.34653 0.40902 0.46714

So the average correlation under sampling was pretty close to the nominal value 0.2. but it did vary fairly widely.

How to calculate marginal probabilities for generating correlated binary variables

2 Answers2