Mixture of Gaussians using scikit learn mixture

Question

I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.

The data looks like this: enter image description here

So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:

data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")

And here's what I say when I try it with python:

from sklearn import mixture
import numpy as np
import os
import pyplot

os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)

classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()

Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?

Is there a difference in the models used by Mclust and by sklearn.mixture?

But more important: what is the best way in sklearn to cluster my data?

Does Mclust use full covariance by default? – Andreas Mueller Feb 11 '15 at 00:12 — Andreas Mueller, Feb 11 '15 at 00:12

score 1 · Answer 1 · answered Feb 10 '15 at 18:19

1

The trick is to set GMM's min_covar. So in this case I get good results from:

mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)

The large default value for min_covar assigns all points to one cluster.

answered Feb 10 '15 at 18:19

David DeWert

91
1
5

1

How is your data scaled? I'm not sure if the default is scale-invariant and maybe we should change that... – Andreas Mueller Feb 11 '15 at 00:19
I didn't think of scaling the data. It works well with the default _min_covar_ if I say: `data=scale(data)` and then `gmm.fit(data)` – David DeWert Feb 11 '15 at 17:32

Mixture of Gaussians using scikit learn mixture

1 Answers1

Linked