K-Means Algorithm

Question

Possible Duplicates:
How to optimal K in K - Means Algorithm
How do I determine k when using k-means clustering?

Depending on the statistical measures can we decide on the K. Like Standard Deviation, Mean, Variance etc., Or

Is there any simple method to choose the K in K-means Algorithm?

Thanks in advance Navin

I've answered a similar Q with half a dozen methods (using `R`) over here: stackoverflow.com/a/15376462/1036500 — Ben, May 13 '13 at 04:51

score 5 · Answer 1 · edited Jul 24 '17 at 05:16

If you explicitly want to use k-means you could study the article describing x-means. When using an implementation of x-means the only difference compared to k-means, is that rather than specifying a single k, you specify a range for k. The "best" choice, wrt. some measure, in the range will be part of the output from x-means. You can also look into the Mean Shift clustering algorithm.

If it is computationally feasible with your given data (possibly using sampling as yura suggests), you could do clustering with various k's and evalute the quality of the resulting clusters using some of the standard cluster validity measures. Some of the classic measures are described here: measures.

@doug It is not correct that k-means++ determines an optimal k for the number of clusters before cluster assignments start. k-means++ differs from k-means only by instead of randomly choosing the initial k centroids, it chooses one initial centroid randomly and successively chooses centers until k has been chosen. After the initial completely random choice, data points are chosen as a new centroid with a probability that is determined by a potential function which depends on the datapoint's distance to the already chosen centers. The standard reference for k-means++ is k-means++: The Advantages of Careful Seeding by Arthur and Vassilvitskii.

Also, I don't think that in general choosing k to be the number of principal components will improve your clustering. Imagine data points in three-dimensional space all lying in a plane passing through the origo. You will then get 2 principal components, but the "natural" clustering of the points could have any number of clusters.

I agree with you about the PCA's role in k-means. I think, PCA may increase the speed of the algorithm since the difference among the data points can be computed faster for lower number of dimensions. But it doesn't necessarily decrease the number of clusters. — petrichor, Jun 17 '11 at 15:11

score 4 · Answer 2 · answered Jun 15 '11 at 06:47

4

Unfortunately not. There isn't a principled statistical method, simple or complex that can set the "right K". There are heuristics, rules of thumb that sometimes work, sometimes don't.

The situation is more general as many clustering methods have these type of parameters.

answered Jun 15 '11 at 06:47

carlosdc

12,022
4
45
62

2

What is one of these rules of thumb? – uh_big_mike_boi Apr 12 '16 at 15:34

score 4 · Answer 3 · edited Aug 16 '23 at 22:02

Well there are two practical solutions to the the problem of intelligent selection of the number of centroids (k) in common use.

The first is to PCA your data, and the output from PCA--which is the principal components (eigenvectors) and their cumulate contribution to the variation observed in the data--obviously suggests an optimal number of centroids. (E.g., if 95% of the variability in your data is explained by the first three principal components, then k=3 is a wise choice for k-means.)

The second commonly used practical solution to intelligently estimate k is is a revised implementation of the k-means algorithm, called k-means++. In essence, k-means++ just differs from the original k-means by the additional of a pre-processing step. During this step, the number and initial position of the centroids and estimated.

The algorithm that k-means++ relies on to do this is straightforward to understand and to implement in code. A good source for both is a 2007 Post in the LingPipe Blog, which offers an excellent explanation of k-means++ as well as includes a citation to the original paper that first introduced this technique.

Aside from providing an optimal choice for k, k-means++ is apparently superior to the original k-means in both performance (roughly 1/2 processing time compared with k-means in one published comparison) and accuracy (three orders of magnitude improvement in error in the same comparison study).

Please excuse me if I'm wrong, but I understood that k-means++ gives better initialization, but it does not determine the number of centroids? — Kobe-Wan Kenobi, Oct 06 '14 at 11:02

score 1 · Answer 4 · answered Jun 17 '11 at 14:11

1

Bayesian k-means may be a solution when you don't know the number of clusters. There's a related paper given in the website and the corresponding MATLAB code is also given.

answered Jun 17 '11 at 14:11

petrichor

6,459
4
36
48

score 0 · Answer 5 · answered Jun 16 '11 at 06:06

0

The best solution for unkown(by statistical paramters model etc) ML problem is to sample data and find parameters thet best for sub problem, then use them on full problem. In that case select best K for 5% of data.

answered Jun 16 '11 at 06:06

yura

14,489
21
77
126

K-Means Algorithm

5 Answers5

Linked