2

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.

The data looks like this: enter image description here

So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:

data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")

And here's what I say when I try it with python:

from sklearn import mixture
import numpy as np
import os
import pyplot

os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)

classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()

Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?

Is there a difference in the models used by Mclust and by sklearn.mixture?

But more important: what is the best way in sklearn to cluster my data?

David DeWert
  • 91
  • 1
  • 5

1 Answers1

1

The trick is to set GMM's min_covar. So in this case I get good results from:

mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)

The large default value for min_covar assigns all points to one cluster.

David DeWert
  • 91
  • 1
  • 5