3

I am doing Gaussian mixture models. I have done kmeans on the dataset and I want to use the means, variances and the size for the initial parameters for the em algorithm in R. I found that the parameters is a list of 3 and I tried to do the same thing but it gives me the following error :

Error in array(x, c(length(x), 1L), if (!is.null(names(x))) list(names(x), : 'data' must be of a vector type, was 'NULL'

My code

l <- kmeans(iris[,-5],centers=3)
pi <- l$size/length(iris[,1])
my <- t(l$centers)
sig <- vector("list", 3)
new <- as.data.frame(cbind(iris[,-5],l$cluster))
for (i in 1:3) {
  subdata<-subset(new[,1:4],new[,5]==i); 
  sig[[i]]<-cov(subdata)
}

par <- vector("list",3)
par[[1]] <- pi; par[[2]] <- my; par[[3]] <- sig

kk <- em(modelName = msEst$modelName, data = iris[,-5],parameters = par)

Can someone please tell how should I assign the kmeans results as initial parameters?

TylerH
  • 20,799
  • 66
  • 75
  • 101
Birgit
  • 61
  • 1
  • 3
  • em is not part of base R. What package are you using? Also, what is msEst? – G5W Apr 14 '18 at 18:47
  • I used MClust package and msEst is just the model type (e.g "EEE"). I 't find info about how to give inital parameters any other way for GMM – Birgit Apr 15 '18 at 11:40

1 Answers1

3

Following is a quick example of what you seem to be after. The main thing you have to do is the get the parameters argument in the correct form. The tickly bit is with the variance list. There is a bit of help with this if you use the mclustVariance function.

library(mclust)

g <- 3
dat <- iris[, -5]
p <- ncol(dat)
n <- nrow(dat)
k_fit <- kmeans(dat, centers=g)

par <- vector("list", g)
par$pro <- k_fit$size/n
par$mean <- t(k_fit$centers)

sigma <- array(NA, c(p, p, g))
new <- as.data.frame(cbind(dat, k_fit$cluster))
for (i in 1 : g) {

  subdata <- subset(new[, 1 : p], new[, (p+1)]==i) 
  sigma[,, i] <- cov(subdata)
}

variance <- mclustVariance("EEE", d = p, G = g)
par$variance <- variance
par$variance$sigma <- sigma

kk <- em(modelName = "EEE", data = dat, parameters = par)
kangaroo_cliff
  • 6,067
  • 3
  • 29
  • 42
  • Thank you! This is exactly what I wanted. I was just wondering if I run a mclustBIC on the orginal data and then based on that choose the model and then assign the kmeans as initial arameters if that is correct then – Birgit Apr 17 '18 at 19:22
  • @Birgit Not quite sure what you meant. There is no reason to get the BIC of initial models. Run `em` on as many (different) initial models you like, and then compare only the final models you get. – kangaroo_cliff Apr 18 '18 at 00:11
  • I meant that for variance <- mclustVariance("EEE", d = p, G = g) I have to assign the modelname ("EEE"). But how do I know which model I have? – Birgit Apr 18 '18 at 13:08
  • If the structure of the covariance matrix is not known, first fit mixture models with all interesting covariance structures, and then select the best model using something like the BIC. – kangaroo_cliff Apr 19 '18 at 01:25