Generating data from correlation matrix: the case of bivariate distributions

Question

An apparently simple problem: I want to generate 2 (simulated) variables (x, y) from a bivariate distribution with a given matrix of correlation between them. In other wprds, I want two variables/vectors with values of either 0 or 1, and a defined correlations between them.

The case of normal distribution is easy with the MASS package.

df_norm = mvrnorm(
  100, mu = c(x=0,y=0),
  Sigma = matrix(c(1,0.5,0.5,1), nrow = 2),
  empirical = TRUE) %>% 
  as.data.frame()

cor(df_norm)
    x   y
x 1.0 0.5
y 0.5 1.0

Yet, how could I generate binary data from the given matrix correlation?

This is not working:

df_bin = df_norm %>% 
 mutate(
   x = ifelse(x<0,0,1),
   y = ifelse(y<0,0,1))

    x y
1   0 1
2   0 1
3   1 1
4   0 1
5   1 0
6   0 0
7   1 1
8   1 1
9   0 0
10  1 0

Although this creates binary variables, but the correlation is not (even close to) 0.5.

cor(df_bin)
         x         y
x 1.0000000 0.2994996
y 0.2994996 1.0000000

Ideally I would like to be able to specify the type of distribution as an argument in the function (as in the lm() function).

Any idea?

score 0 · Answer 1 · answered Oct 18 '21 at 06:46

I guessed that you weren't looking for binary, as in values of either zero or one. If that is what you're looking for, this isn't going to help.

I think what you want to look at is the construction of binary pair-copula. You said you wanted to specify the distribution. The package VineCopula would be a good start. You can use the correlation matrix to simulate the data after selecting the distribution. You mentioned lm() and Gaussian is an option - (normal distribution). You can read about this approach through Lin and Chagnaty (2021). The package information isn't based on their work, but that's where I started when I looked for your answer.

I used the correlation of .5 as an example and the Gaussian copula to create 100 sets of points in this example:

# vine-copula
library(VineCopula)

set.seed(246543)
df <- BiCopSim(100, 1, .5)

head(df)
#            [,1]       [,2]
# [1,] 0.07585682 0.38413426
# [2,] 0.44705686 0.76155029
# [3,] 0.91419758 0.56181837
# [4,] 0.65891869 0.41187594
# [5,] 0.49187672 0.20168128
# [6,] 0.05422541 0.05756005

Indeed I do want two variables with values of 0 or 1 and a defined correlation between them. I made it clearer in the question now. — Rtist, Oct 18 '21 at 07:06

Generating data from correlation matrix: the case of bivariate distributions

1 Answers1