5

I am using mclust to see various clusters in my data set using various numbers of input (X,Y,Z,R, and S in the script below):

e.g.

elements<-cbind(X,Y,Z,R,S)
dataclust<-Mclust(elements)

I just find out that the order of the input parameters matters and affect the results; in other words elements <- cbind(X,Y,Z,R,S) gives a different clusters than say elements-<cbind(Y,Z,X,R,S). My understanding is that all the input parameters have the same weight and importance in the clustering analysis. am I wrong or is it a bug?

I have seen that in R 2.15.3 and 2 other R versions.

Any comment on or explanation of the above is appreciated.

rcs
  • 67,191
  • 22
  • 172
  • 153
  • Are these (X,Y,Z,R,S) whole instances used in the algorithm? How many clusters are appear in "dataclust" after algorithm run? – Gökhan Çoban Dec 05 '13 at 22:40
  • I cannot reproduce your problem. Please provide an example, as described [here](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). – TWL Dec 06 '13 at 03:09
  • This still appears to be an issue that can come up. I realize this is an old thread, but I saw a 'good' example was never provided, so I'm posting one in hopes of getting clarification on how/why this issue arises. This link is to an an example. Mclust is run on the variables X1, X3, X5, X7, X9, and X11 in two different orders, producing two different solutions. https://gist.github.com/codydh/bb027d23705b5c2a61ec00afe07fe03e – Cody Jun 19 '17 at 20:35
  • Could not reproduce the two different solutions provided by @Cody. I downloaded a fresh install of `mclust` and ran the commands (see [my gist](https://gist.github.com/codydh/bb027d23705b5c2a61ec00afe07fe03e) ) but both provided exact clustering solutions. Initially I thought the comment by @Anony-Mousse made sense because of the **random** nature of the Gaussian Model, but from the [documentation](https://cran.r-project.org/web/packages/mclust/mclust.pdf), `Mclust` computes the most optimal model over various ones, hence it must provide same results (`Mclust` tries 9 different models) – hongsy Jun 24 '17 at 01:20
  • By "9 different models", what I mean is it tries to cluster from `G=1` cluster to `G=9` clusters. – hongsy Jun 24 '17 at 01:24
  • relevant `seesionInfo()`: `R version 3.3.2 (2016-10-31) Platform: x86_64-apple-darwin13.4.0 (64-bit) Running under: macOS Sierra 10.12.5`, `mclust_5.3` – hongsy Jun 24 '17 at 01:47
  • @Cody, could you please provide sessionInfo for your gist? – m-dz Jun 24 '17 at 12:45

4 Answers4

2

Unfortunately, I am unable to comment or edit my previous comment, so I'm posting an answer. @m-dz set me on a path that I think has revealed a possible answer. Specifically:

> library(mclust)
    __  ___________    __  _____________
   /  |/  / ____/ /   / / / / ___/_  __/
  / /|_/ / /   / /   / / / /\__ \ / /   
 / /  / / /___/ /___/ /_/ /___/ // /    
/_/  /_/\____/_____/\____//____//_/    version 5.2.2
Type 'citation("mclust")' for citing this R package in publications.

> testDataA <- read.table("http://fimi.ua.ac.be/data/chess.dat")

> summary(Mclust(subset(testDataA, select = c(V1, V3, V5, V7, V9, V11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> summary(Mclust(subset(testDataA, select = c(V11, V9, V1, V3, V5, V7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EII (spherical, equal volume) model with 9 components:

 log.likelihood    n df      BIC       ICL
      -3597.466 3196 63 -7703.32 -7735.137

Clustering table:
  1   2   3   4   5   6   7   8   9 
774 150 752 486 227 224 238 178 167 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] mclust_5.2.2

loaded via a namespace (and not attached):
[1] tools_3.3.2

As you can see, this produces two solutions that match @m-dz's! However, what I was previously doing was loading the psych package. I'm seeing now this is masking sim from mclust. I'm guessing this then causes the incorrect solutions:

> library(psych)

Attaching package: ‘psych’

The following object is masked from ‘package:mclust’:

    sim

> testDataB <- read.file(f = "http://fimi.ua.ac.be/data/chess.dat")
Data from the .data file http://fimi.ua.ac.be/data/chess.dat has been loaded.

> summary(Mclust(subset(testDataB, select = c(X1, X3, X5, X7, X9, X11))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 2 components:

 log.likelihood    n df      BIC      ICL
       3547.068 3195 49 6698.738 6692.126

Clustering table:
   1    2 
2759  436 

> summary(Mclust(subset(testDataB, select = c(X11, X9, X1, X3, X5, X7))))
----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm 
----------------------------------------------------

Mclust EEV (ellipsoidal, equal volume and shape) model with 6 components:

 log.likelihood    n  df      BIC      ICL
       18473.94 3195 137 35842.37 35834.51

Clustering table:
  1   2   3   4   5   6 
431 932 210 881 524 217 

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS Sierra 10.12.5

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] psych_1.6.9  mclust_5.2.2

loaded via a namespace (and not attached):
[1] parallel_3.3.2 tools_3.3.2    foreign_0.8-67 mnormt_1.5-5  
Cody
  • 273
  • 1
  • 2
  • 8
  • It's not the `psych` package itself, looks like it is the `read.file` function, which is reading in the first row as a header and messing with the rest of the script - see my edit. – m-dz Jun 25 '17 at 17:56
  • Small update, looks like columns 1:6 give different results. Need to investigate further. – m-dz Jun 25 '17 at 18:25
1

Often, Gaussian Mixture Model clustering is initialized randomly, as it will only find a local maximum.

Don't expect it to return the same result all the time.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • The package may have been updated over the years. Using the main `Mclust` function in the package gives "The **optimal model** according to BIC for EM initialized by **hierarchical clustering**..." Hierarchical clustering **should** in theory give same outputs over repeated runs. – hongsy Jun 24 '17 at 01:25
  • Hmmmm. The Mclust package apparently follows an unusual approach here. Hierarchical clustering is very expensive to use for initialization. I'm mostly using the ELKI EM implementation, and it uses the same initialization as k-means. – Has QUIT--Anony-Mousse Jun 24 '17 at 10:24
  • But even hclust may yield different results, although I'd expect this to be rare (on discrete or low resolution fatal only), and to happen on permutation of rows not columns. – Has QUIT--Anony-Mousse Jun 24 '17 at 10:27
1

Edit:

My previous edit re. read.file treating the first row as a header is correct, but this is not the case. Apparently columns 1 to 6, regardless whether called V1, V2, V3, V4, V5, V6 or X1, X3, X5, X7, X9, X11, do give different results. I will investigate further slightly later.

library(mclust)
library(psych)
library(magrittr)
# sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# Matrix products: default
# 
# locale:
#   [1] LC_COLLATE=English_United Kingdom.1252 
# [2] LC_CTYPE=English_United Kingdom.1252   
# [3] LC_MONETARY=English_United Kingdom.1252
# [4] LC_NUMERIC=C                           
# [5] LC_TIME=English_United Kingdom.1252    
# 
# attached base packages:
#   [1] stats     graphics  grDevices utils     datasets  methods  
# [7] base     
# 
# other attached packages:
#   [1] magrittr_1.5 psych_1.7.5  mclust_5.3  
# 
# loaded via a namespace (and not attached):
#   [1] compiler_3.4.0    parallel_3.4.0    tools_3.4.0      
# [4] foreign_0.8-68    rstudioapi_0.6    mdaddins_0.0.0001
# [7] nlme_3.1-131      mnormt_1.5-5      grid_3.4.0       
# [10] lattice_0.20-35  

testData_rt <- read.table("http://fimi.ua.ac.be/data/chess.dat")
testData_rf <- read.file("http://fimi.ua.ac.be/data/chess.dat", header = FALSE)  # Without this read.file is skipping first row
testData_rf_head <- read.file("http://fimi.ua.ac.be/data/chess.dat")
testData_rf_head %<>%set_names(names(testData_rf))
testData_rf_head_V2 <- read.file("http://fimi.ua.ac.be/data/chess.dat")

testData_rt %>% str()
testData_rf %>% str()
testData_rf_head %>% str()

# Same res.:
summary(Mclust(subset(testData_rt, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rt, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf, select = c(V11, V9, V1, V3, V5, V7))))

# Same res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V3, V5, V7, V9, V11))))
summary(Mclust(subset(testData_rf_head, select = c(V11, V9, V1, V3, V5, V7))))

# Different res.:
summary(Mclust(subset(testData_rf_head_V2, select = c(X1, X3, X5, X7, X9, X11))))
summary(Mclust(subset(testData_rf_head_V2, select = c(X11, X9, X1, X3, X5, X7))))

# Different res.:
summary(Mclust(subset(testData_rf_head, select = c(V1, V2, V3, V4, V5, V6))))
summary(Mclust(subset(testData_rf_head, select = c(V6, V5, V1, V2, V3, V4))))

Old answer:

Have done my best to investigate the issue:

  • Current R (3.4.0) and mclust (5.3) tested: order and seed had no effect;
  • mclust 4.2 (current on Dec 5 '13 when the question was asked), the same, no effect;
  • R 2.25.3 mentioned by @user3068797: could not compile mclust 4.2, gave up as it would take too long to debug this;
  • @Cody did not provide a sessionInfo(), so do not know where to dig more.

To the code:

library(mclust)
sessionInfo()
# R version 3.4.0 (2017-04-21)
# Platform: x86_64-w64-mingw32/x64 (64-bit)
# Running under: Windows >= 8 x64 (build 9200)
# 
# other attached packages:
# [1] mclust_5.3

testData <- read.table("http://fimi.ua.ac.be/data/chess.dat")

## Seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
#   Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 



## Question asked asked Dec 5 '13
## mclust 4.2 modified on 2013-07-19, 4.3 introduced on 2014-03-31
devtools::install_version(package = 'mclust', version = 4.2)

## Fix mclust:::unchol
# mclust:::unchol
unchol <- function(x, upper = NULL)
{
  if(is.null(upper)) {
    upper <- any(x[row(x) < col(x)])
    lower <- any(x[row(x) > col(x)])
    if(upper && lower)
      stop("not a triangular matrix")
    if(!(upper || lower)) {
      x <- diag(x)
      return(diag(x * x))
    }
  }
  dimx <- dim(x)
  storage.mode(x) <- "double"
  .Fortran("uncholf",
           as.logical(upper),
           x,
           as.integer(nrow(x)),
           as.integer(ncol(x)),
           integer(1),
           PACKAGE = "mclust")[[2]]
}
assignInNamespace("unchol", unchol, ns = "mclust")
# fixInNamespace(unchol, pos = "package:mclust")
mclust:::unchol

## Again, seed and order have no effect:
# set.seed(1)
set.seed(2)
summary(Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
#  log.likelihood    n df      BIC       ICL
#       -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V1, V3, V5, V7, V9, V11))) :
#   optimal number of clusters occurs at max choice

set.seed(1)
# set.seed(2)
summary(Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))))
# ----------------------------------------------------
# Gaussian finite mixture model fitted by EM algorithm 
# ----------------------------------------------------
#   
# Mclust EII (spherical, equal volume) model with 9 components:
#   
# log.likelihood    n df      BIC       ICL
#      -3597.466 3196 63 -7703.32 -7735.137
# 
# Clustering table:
#   1   2   3   4   5   6   7   8   9 
# 774 150 752 486 227 224 238 178 167 
# 
# Warning messages:
#   1: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
#   best model occurs at the min or max # of components considered
# 2: In Mclust(subset(testData, select = c(V11, V9, V1, V3, V5, V7))) :
#   optimal number of clusters occurs at max choice



## Check R 2.15.3 from https://cran.r-project.org/bin/windows/base/old/2.15.3/
## Trued with fixing con <- gzcon(url("http://cran.rstudio.com/src/contrib/Meta/archive.rds", 'rb')), but compile...
devtools::install_version(package = 'mclust', version = 4.2)

Edit:

Fortran functions unchol (mclust 4.2) and uncholf (mclust 5.3) do not differ: uncholf 5.3, unchol 4.3

Mclust does differ, but provide same results, so I guess changes were simply fixing errors etc.: Mclust 5.3 , Mclust 4.3

m-dz
  • 2,342
  • 17
  • 29
  • 1
    Thanks for this! I've given Cody the Bounty seeing as he doesn't have enough points to effectively operate his SO account. Hope this is okay! – P i Jun 26 '17 at 15:02
  • I was able to replicate this same issue. The interesting part here is really the difference between your third "Same result" example and your second "Different result" example, given that they're using the same data frame. Using the 1st, 3rd, 5th, 7th, 9th, and 11th variables in two orders produces the same solution, but the 1st, 2nd, 3rd, 4th, 5th, and 6th variables in two orders does not. – Cody Jun 26 '17 at 15:40
  • @Pi, yeap, it is fine. I should shed a bit of light to the question in the next 5-6 hours. – m-dz Jun 26 '17 at 19:09
  • @Cody, it is the same data frame, but different columns. The header row really does not make the difference. To quickly sum up what I have found - it is all about `mclust::me()` function with model type 'EEV' (off the top of my head). Will see later! – m-dz Jun 26 '17 at 19:12
  • @m-dz: Looking forward to what you've found. I have discovered that randomizing the row order of my data.frame and then passing it to Mclust also produces different solutions. Perhaps what you've found will shed some light on that. – Cody Jun 27 '17 at 12:41
  • Apologies, I got really busy at work and just could not fully wrap my head around this and transfer my findings into a neat piece of code... If you check the output of `Mclust()` there is an object called `BIC` with all BIC scores and they differed only for one of the models (`EEV` if I remember right). The function calculating them is `mclust::me()`, would be worth to dig a bit more into it. – m-dz Jun 29 '17 at 10:14
1

I notice this is a very old thread but I think it's still worth to post an (official) answer. The issue was addressed in mclust 5: Clustering, Classification and Density Estimation Using Gaussian Finite Mixture Models from the R Journal with suggested solutions ([https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf][1]), Pages 305-307. In short, "A problem with the MBHAC approach may arise in the presence of coarse data, resulting from the discrete nature of the data or from continuous data that are rounded when measured. In this case, ties must be broken by choosing the pair of entities that will be merged. This is often done at random, but regardless of which method is adopted for breaking ties, this choice can have important consequences because it changes the clustering of the remaining observations. Moreover, the final EM solution may depend on the ordering of the variables."

Weiyu Qiu
  • 11
  • 1