Expectation Maximization Issue - How to find the optimum number of gaussians within the data

Question

Plot of 2 - Dimensional data

Is there any algorithm or trick of how to determine the number of gaussians which should be identified within a set of data before applying the expectation maximization algorithm?

For example, in the above illustrated plot of 2 - Dimensional data, when I apply the Expectation Maximization algorithm, I try to fit 4 gaussians to the data and I would obtain the following result.

enter image description here

But what if I wouldn't knew the number of gaussians within the data? Is there any algorithm or trick which I could apply so that I could find out this detail?

you should read this wikipedia page: http://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set, also research "clustering validation" techniques — Amro, Jun 29 '11 at 19:22
also this question might be better suited on: http://stats.stackexchange.com/ — Amro, Jun 29 '11 at 19:27
Look at the [`kmeans`](http://www.mathworks.com/help/toolbox/stats/kmeans.html) function, if you have the stats toolbox. — abcd, Jun 29 '11 at 19:33
Consider using `DBSCAN` or `OPTICS` instead of old k-means with the difficult to set `k` parameter. — Has QUIT--Anony-Mousse, Jan 08 '12 at 11:29

score 8 · Answer 1 · answered Jun 29 '11 at 22:16

This might be a bit of a retread, since others already linked the wiki article of the actual cluster number determination, but I found that article a lil overly dense, so I thought I'd provide a brief, intuitive answer:

Basically, there isn't a universally 'correct' answer for the number of clusters in a data set -- the fewer clusters, the smaller the description length but the higher the variance, and in all non-trivial datasets the variance won't completely go away unless you have a Gaussian for each point, which renders the clustering useless (this is a case of the more general phenomena known as the 'futility of bias free learning': A learner that makes no a priori assumptions regarding the identity of the target concept has no rational basis for classifying any unseen instances).

So you basically have to pick some feature of your dataset to maximize via the number of clusters (see the wiki article on inductive bias for some example features)

In other sad news, in all such cases finding the number of clusters is known to be NP-hard, so the best you can expect is a good heuristic approach.

score 1 · Accepted Answer · answered Jun 29 '11 at 19:28

1

Wikipedia has an article on this subject. I am not too familiar with the subject, but I've been told that clustering algorithms that don't require specifying the number of clusters instead need some density information about the clusters or some minimum distance between clusters.

answered Jun 29 '11 at 19:28

David Brown

13,336
4
38
55

1

Use e.g. `DBSCAN` or `OPTICS`. See Wikipedia for details. – Has QUIT--Anony-Mousse Jan 08 '12 at 11:28

score 1 · Answer 3 · answered Aug 14 '11 at 16:49

1

Non parametric bayesian clustering is now getting lot of attention. You dont need to specify clusters.
Autoclass is algorithm that automatically identify number of clusters from mixture.

answered Aug 14 '11 at 16:49

Ejaz

11
1

Expectation Maximization Issue - How to find the optimum number of gaussians within the data

3 Answers3