Why is k-means clustering ignoring a significant patch of data?

Question

I'm working with a set of co-ordinates, and want to dynamically (I have many sets that need to go through this process) understand how many distinct groups there are within the data. My approach was to apply k-means to investigate whether it would find the centroids and I could go from there.

When plotting some data with 6 distinct clusters (visually) the k-means algorithm continues to ignore two significant clusters while putting many centroids into another.

See image below:

Clusters and Centroids

Red are the co-ordinate data points and blue are centroids that k-means has provided. In this specific case I've gone for 15 (arbitrary), but it still doesn't recognise those patches of data on the right hand side, rather putting a mid point between them while putting in 8 in the cluster in the top right.

Admittedly there are slightly more data points in the top right, but not by much.

I'm using the standard k-means algorithm in R and just feeding in x and y co-ordinates. I've tried standardising the data, but this doesn't make any difference.

Any thoughts on why this is, or other potential methodologies that could be applied to try and dynamically understand the number of distinct clusters there are in the data?

You could try standardizing the data, and increasing `iter.max` and `nstart` (maybe `iter.max = 500` and `nstart = 25`)? Any chance you could post your data? — bouncyball, Aug 17 '17 at 14:25
It's easier to help if you provide a [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input data so we can run and test the code to see what's going on. — MrFlick, Aug 17 '17 at 14:44

score 0 · Answer 1 · answered Aug 17 '17 at 14:30

You could try with Self-organizing map:

this is a clustering algorithm based on Neural Networks which create a discretized representation of the input space of the training samples, called a map, and is, therefore, a method to do dimensionality reduction (SOM).

This algorithm is very good for clustering also because does not require a priori selection of the number of clusters (in k-mean you need to choose k, here no). In your case, it hopefully finds automatically the optimal number of cluster, and you can actually visualize it.

You can find a very nice python package called somoclu which has got this algorithm implemented and an easy way to visualize the result. Else you can go with R. Here you can find a blog post with a tutorial, and Cran package manual for SOM.

score 0 · Answer 2 · answered Aug 18 '17 at 06:08

0

K-means is a randomized algorithm and it will get stuck in local minima.

Because of these problems, it is common to run k-means several times, and keep the result with least squares, I.e., the best of the local minima found.

answered Aug 18 '17 at 06:08

Has QUIT--Anony-Mousse

76,138
12
138
194

Why is k-means clustering ignoring a significant patch of data?

2 Answers2