9

I would like to simulate a mixture data, say 3 dimensional data. I would like to have 2 different components between each two variables.

That is, simulate mixture data (V1 and V2) where the dependencies between them is two different normal components. Then, between V2 and V3 another two normal components. So, I will have 3d data, the dependence between first and a second variable are a mixture of two normals. And the dependence between the second and third variable are mixture of another two different components.

Another way to explain my question:

Suppose I would like to generate a mixture data as follows:

1- 0.3 normal(0.5,1) + 0.7 normal(2,4) # hence here I will get a bivariate mixture data generated from two different normal (two components of the mixture model), the sum of mixuter weight is 1.

Then, I would like to get another variable as follows:

2- 0.5 normal(2,4) # this is the second variable on the first simulate + 0.5 normal(2,6)

so here, I get 3d simulated mixture data, where V1 and V2 are generated by two different mixture components, and the V2 and V3 are generated by another different mixture components.

This is how to generate the data in r: ( I belive it is not generate a bivariate data)

N <- 100000                 

#Sample N random uniforms U
U <- runif(N)

#Variable to store the samples from the mixture distribution                                             
rand.samples <- rep(NA,N)

#Sampling from the mixture
for(i in 1:N) {
    if(U[i]<.3) {
        rand.samples[i] <- rnorm(1,1,3)
    } else {
        rand.samples[i] <- rnorm(1,2,5)
    }
}

so if we generate mixture bivariate data (two variables) then how can extend this to have 4 or 5 variables, where V1 and V2 are generated from two different normals (the dependencies structures between them is a mixture of two normals) and then V3 will generated from another another differetn normal and then compine with V2. That is when we plot the V2 ~ V3 we will find that the dependencies structures between them is a mixutre of two normals and so on.

  • Specify which model generates V1-3. – Roman Luštrik Apr 29 '17 at 09:10
  • If I understand your comments, all are normal with different mean and standard deviation. If not what you asking about, could you please make it clears for me? Thanks –  Apr 29 '17 at 09:12
  • It's not clear what you mean by "components". I have assumed it means model, i.e. V1 is generated by `y ~ x1 + x2^2 + x3`... – Roman Luštrik Apr 29 '17 at 09:17
  • 2
    In a mixture, we call the distributions that generated this mixture data, a component of the mixture model. That is, normal distribution with mean 1 and standard deviation 2 is one component and another normal with different mean is another component and so on. –  Apr 29 '17 at 09:21

1 Answers1

5

I am not really sure I have correctly understood the question but I will give it a try. You have 3 distributions D1, D2 and D3. From these three distributions you would like to create variables that use 2 out of those 3 but not the same ones.

Since I do not know how the distributions should be combined I used the flags using the binomial distribution (its a vector of length equal to 200 with 0s and 1s) to determine from which distribution each value will be picked (You can change that if that is not how you want it done).

D1 = rnorm(200,2,1)
D2 = rnorm(200,3,1)
D3= rnorm(200,1.5,2)

In order to created the mixed distribution we can use the rbinom function to create a vector of 1s and 0s according to a selected probability. This is a way to have some values from both distributions.

var_1_flag <- rbinom(200, size=1, prob = 0.3)
var_1 <- var_1_flag*D1 + (1 - var_1_flag)*D2

var_2_flag <- rbinom(200, size=1, prob = 0.7)
var_2 <- var_2_flag*D2 + (1 - var_2_flag)*D3

var_3_flag <- rbinom(200, size=1, prob = 0.6)
var_3 <- var_3_flag*D1 + (1 - var_3_flag)*D3

In order to see which values come from which distribution you can do the following:

var_1[var_1_flag] #This gives you the values in the mixed distribution that come from the first distribution (D1)

var1[!var_1_flag] #This gives you the values in the mixed distribution that come from the second distribution (D2)

Since I found this a bit manual and I am guessing you might want to change the variables, you might want to use the function below to get the same results

create_distr <- function(observations, mean1, sd1, mean2, sd2, flag_prob) {

    flag <- rbinom(observations, size=1, prob = flag_prob)
    my_distribution <- flag * rnorm(observations, mean1, sd1) + (1 - flag) * rnorm(observations, mean2, sd2)
}

var_1 <- create_distr(200, 2, 1, 3, 1, 0.5)
var_2 <- create_distr(200, 3, 1, 1.5, 2, 0.7)
var_3 <- create_distr(200, 2, 1, 1.5, 2, 0.6)

If you would like to have more than two variables (distributions) to the mix you could extend the code you have provided as follows:

N <- 100000                 

#Sample N random uniforms U
U <- runif(N)

#Variable to store the samples from the mixture distribution                                             
rand.samples <- rep(NA,N)

for(i in 1:N) {
  if(U[i] < 0.3) {
    rand.samples[i] <- rnorm(1,1,3)
  } else if (U[i] < 0.5){
    rand.samples[i] <- rnorm(1,2,5)
  } else if (U[i] < 0.8) {
    rand.samples[i] <- rnorm(1,5,2)
  } else {
    rand.samples[i] <- rt(1, 2)
  }
}

This way every element is taken one at a time from each distribution. If you want to have the same result but without taking each element one at a time you can do the following:

N <- 100000                 

#Sample N random uniforms U
U <- runif(N)

#Variable to store the samples from the mixture distribution                                             
rand.samples <- rep(NA,N)

D1 = rnorm(N,1,3)
D2 = rnorm(N,2,5)
D3= rnorm(N,5,2)
D4 = rt(N, 2)

rand.samples <- c(D1[U < 0.3], D2[U >= 0.3 & U < 0.5], D3[U >= 0.5 & U < 0.8], D4[U >= 0.8])

Which corresponds to 0.3*normal(1,3) + 0.2*normal(2,5) + 0.3*normal(5,2) + 0.2*student(2 degrees of freedom)

If you want to create two mixtures, but in the second keep the same values from the normal distribution you can do the following:

mixture_1 <- c(D1[U < 0.3], D2[U >= 0.3 ])
mixture_2 <- c(D1[U < 0.3], D3[U >= 0.3])

This will use the exact same elements from normal(1,3) in both mixtures. The trick is to not recalculate the rnorm(N,1,3) every time you use it. And in both cases the samples are composed from 30% roughly coming from the first normal (D1) and 70% roughly from the second distribution. For example:

    set.seed(1)
    N <- 100000   
    U <- runif(N)
    > prop.table(table(U < 0.3))

 FALSE   TRUE 
0.6985 0.3015 

30% of the values in the U vector is below 0.3.

User2321
  • 2,952
  • 23
  • 46
  • Thank you for your answer. My question is how can I simulate a multivariate mixture data where between each two variables I have different dependencies. That, is: between V1 and V2, I have mixture of Gaussian and t-student, between V2 and V3, I have a mixture of bionomial and Gusssian and so on. –  Jun 02 '17 at 21:41
  • Hi! First of all, if I have not really answered your question do not flag it as an accepted answer, it might discourage somebody else from trying to answer it! :) In your updated question, V1 corresponds to 0.3 normal(0.5,1) + 0.7 normal(2,4), V2 corresponds to 0.5 normal(2,4) + 0.5 normal(2,6) and V3 is what? – User2321 Jun 03 '17 at 12:05
  • Hi, oh! I think 0.3 normal(0.5,1) + 0.7 normal(2,4) will give me 2 variables where the dependience structure between them is a mixture of two normals normal(0.5,1) + normal(2,4) –  Jun 03 '17 at 16:17
  • normal(0.5,1) + 0.7 normal(2,4) means that our observations are belongs to two normals (30%) of the observation belongs to the first normal and other (70%) belongs to another one. Hence, I think we have two variables and we would like to know from where each observation was drawn. –  Jun 04 '17 at 08:27
  • 0.3*normal(1,2) doen't mean that we multiple by 0.3. It means that 30% of our observation belongs to normal distribution with mean=1 and SD= 2 and this is the way how we express the formula of mixture (in algebra). Please see my edit. –  Jun 05 '17 at 10:28
  • Thank you so much for your help. I really would like to accept your answer. However, I understand this and tried it long time a go and does not give me what I want. So, suppose I have this 0.7*normal(1,3) + 0.3*normal(2,5) as bivariate data. Then, I repeat the code again to have another bivariate data where the first Variable here is the 0.3*normal(2,5). Then the new bivariate data will be like this 0.3*normal(2,5)+0.7*normal(5,2). so, I will get two different bivariate data. Then combine them since V2 is the common variable between these two data. –  Jun 05 '17 at 12:24
  • Yes. That is my question. Then, say I will generate 4 variables, where the dpendence between V1 and V2 is mixture of normals. V2 (0.3normal(2,5) and V3 are mixture of normal and t (for example). –  Jun 05 '17 at 12:48
  • just for your information. The sum of mixture weight (e.g. 0.7 and 0.3) must sum to one. –  Jun 05 '17 at 13:04
  • Yes yes I know that and in the answer it is always true. – User2321 Jun 05 '17 at 13:16