1

I have the written the following code in R with an aim to generate correlated random variables which follow the beta distribution

#objective is to generated correlated 
#beta distributed data 

library(MASS)
library(faux)

generate.beta.parameters <- function(x, v) {
  alpha = ((1 - x) / v - 1/ x) * x^2
  beta = alpha * (1 / x - 1)
  return (c(alpha, beta))
}

x1 <- 0.896 
v1 <- 0.001 
x2 <- 0.206 
v2 <- 0.004

b1 <- generate.beta.parameters(x1, v1)
b2 <- generate.beta.parameters(x2, v2)

alpha1 <- b1[1]
beta1 <- b1[2]
alpha2 <- b2[1]
beta2 <- b2[2]


#create mean vector 
mu = c(0, 0)

#create variance covariance matrix 
sigma <- rbind(c(1, 0.2), c(0.2, 1))

#generate 1000 random numbers 
df <- mvrnorm(n = 1000, mu = mu, Sigma = sigma)

df.beta <- matrix(nrow = nrow(df), ncol = ncol(df))
#normal to uniform 
df.beta[,1] = norm2beta(df[,1], alpha1, beta1)
df.beta[,2] = norm2beta(df[,2], alpha2, beta2)

df.beta <- as.data.frame(df.beta)
cor(df.beta)

However, on running this code, the correlation outputs are not the same as expected. Here is the output on my machine:

          V1        V2
V1 1.0000000 0.1549214
V2 0.1549214 1.0000000

Could you throw some light on this?

I do have this link, where random numbers have been generated, but it is in a different software i.e. SAS.

Indian
  • 977
  • 3
  • 12
  • 24

1 Answers1

3

You have two problems. The first is that this is a random sample, so you don't expect the correlation to come out exactly equal to the specified value. Let's try running 500 samples and see what distribution of correlations we get:

testcor <- function(rho = 0.2, n = 1000, seed = NULL) {
    if (!is.null(seed)) set.seed(seed)
    sigma <- rbind(c(1, rho), c(rho, 1))
    df <- mvrnorm(n = n, mu = mu, Sigma = sigma)
    df.beta <- matrix(nrow = nrow(df), ncol = ncol(df))
    df.beta[,1] = norm2beta(df[,1], alpha1, beta1)
    df.beta[,2] = norm2beta(df[,2], alpha2, beta2)
    cor(df.beta)[1,2]
}
set.seed(101)
corvec <- replicate(500, testcor())
par(las = 1)
hist(corvec, breaks = 50, main ="")
abline(v = 0.2, col = 2, lwd = 2)

enter image description here

The value of 0.155 that you got is toward the lower end of this distribution, but not extreme. Overall the correlation might be biased slightly low - the mean is 0.196 - but it looks OK. If you need the simulated correlation to be exactly 0.2 for some reason, that will be more challenging - you can set empirical=TRUE in mvrnorm to get means and variances that exactly match the specified values, but that leads us to the second problem.


Feeding the multivariate normal through a nonlinear transformation will change the correlation - not very much in this case (because the distribution of correlations is still centered close to 0.2), but some. This is a mathematical problem, not a programming problem. You've essentially reimplemented a Gaussian copula model; you can see in the linked example that the correlations aren't the same as those from the original multivariate normal distribution.

One method for simulating a bivariate distribution with marginal distributions that are Beta and a specified correlation is described in Minhajuddin et al 2004 (section 3), but it's significantly more challenging than the approach you've tried here.

The quick-and-dirty way to handle this would be tweak the correlation parameter of the MVN to get the desired correlation of the transformed bivariate Beta ...


Minhajuddin, Abu T. M., Ian R. Harris, and William R. Schucany. 2004. “Simulating Multivariate Distributions with Specific Correlations.” Journal of Statistical Computation and Simulation 74 (8): 599–607. https://doi.org/10.1080/00949650310001626161.

Ben Bolker
  • 211,554
  • 25
  • 370
  • 453