5

I want to generate 2 continuous random variables Q1, Q2 (quantitative traits, each are normal) and 2 binary random variables Z1, Z2 (binary traits) with given pairwise correlations between all possible pairs of them. Say

(Q1,Q2):0.23 
(Q1,Z1):0.55 
(Q1,Z2):0.45 
(Q2,Z1):0.4 
(Q2,Z2):0.5 
(Z1,Z2):0.47 

Please help me generate such data in R.

  • why is this getting up votes? 1) lacking key information; 2) vague; 3) no effort shown; 4) "please do something for me;" 5) SIGH – rawr May 24 '14 at 15:19
  • 2
    to me (1) and (2) don't seem to be true, but (3) and (4) are. I don't know of an easy/simple way to do this, but it does seem well posed. Presumably it's getting up-votes because people are interested in the answer. – Ben Bolker May 24 '14 at 15:28
  • You can easily specify the correlation independently of the mean for the qualitative traits (e.g. make them multivariate normal), but you will probably have to say something about the mean probability of the binary random variables. – Ben Bolker May 24 '14 at 15:39
  • what if op doesn't want a mvn? dont you think that is key information? – rawr May 24 '14 at 15:40
  • Actually, I could show no effort because I don't know how to do this. I can generate correlated normal variables and correlated Bernoulli variables, but this one is beyond my reach. Also, I'd appreciate if someone helps me with the code- or suggests possible some way of doing it. – user2510085 May 24 '14 at 15:44
  • Suppose I have generated Q1 and Q2 with given _rho_ (both are normal) How can I generate Z1 and Z2 ensuring the said condition? – user2510085 May 24 '14 at 15:51
  • http://r.789695.n4.nabble.com/generating-correlated-Bernoulli-random-variables-td828988.html – Ben Bolker May 24 '14 at 16:13
  • http://stackoverflow.com/questions/10535235/generate-correlated-random-numbers-from-binomial-distributions-in-r – Ben Bolker May 24 '14 at 16:14
  • @BenBolker: Yes... the Copula method looks interesting. Gotta try that. – user2510085 May 24 '14 at 16:23

1 Answers1

2

This is crude but might get you started in the right direction.

library(copula)

options(digits=3)
probs <- c(0.5,0.5)
corrs <- c(0.23,0.55,0.45,0.4,0.5,0.47)  ## lower triangle

Simulate correlated values (first two quantitative, last two transformed to binary)

sim <- function(n,probs,corrs) {
    tmp <- normalCopula( corrs, dim=4 , "un")
    getSigma(tmp) ## test
    x <- rCopula(1000, tmp)
    x2 <- x
    x2[,3:4] <- qbinom(x[,3:4],size=1,prob=rep(probs,each=nrow(x)))
    x2
}

Test SSQ distance between observed and target correlations:

objfun <- function(corrs,targetcorrs,probs,n=1000) {
    cc <- try(cor(sim(n,probs,corrs)),silent=TRUE)
    if (is(cc,"try-error")) return(NA)
    sum((cc[lower.tri(cc)]-targetcorrs)^2)
}

See how bad things are when input corrs=target:

cc0 <- cor(sim(1000,probs=probs,corrs=corrs))
cc0[lower.tri(cc0)]
corrs
objfun(corrs,corrs,probs=probs) ## 0.112

Now try to optimize.

opt1 <- optim(fn=objfun,
              par=corrs,
              targetcorrs=corrs,probs=c(0.5,0.5))
opt1$value     ## 0.0208

Stops after 501 iterations with "max iterations exceeded". This will never work really well because we're trying to use a deterministic hill-climbing algorithm on a stochastic objective function ...

cc1 <- cor(sim(1000,probs=c(0.5,0.5),corrs=opt1$par))
cc1[lower.tri(cc1)]
corrs

Maybe try simulated annealing?

opt2 <- optim(fn=objfun,
              par=corrs,
              targetcorrs=corrs,probs=c(0.5,0.5),
              method="SANN")

It doesn't seem to do much better than the previous value. Two possible problems (left as an exercise for the reader are) (1) we have specified a set of correlations that are not feasible with the marginal distributions we have chosen, or (2) the error in the objective function surface is getting in the way -- to do better we would have to average over more replicates (i.e. increase n).

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453