Selecting the K value for Kmeans clustering

Question

I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.

For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.

The sum of squared error vs. The number of clusters

I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.

Looks like your data is not suited for k-means. Does it have well separated spherical clusters? — Has QUIT--Anony-Mousse, Aug 06 '19 at 08:12

score 1 · Answer 1 · answered Aug 01 '19 at 10:06

1

Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.

The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.

The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.

In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.

answered Aug 01 '19 at 10:06

Bjarke Kingo

400
7
14

Thank you for your answer, but what I need to know is why the graph behaves like this? If I select k=3, the error is much higher and if I select k=6, the number of clusters is higher. I don't know whether this happens because of a problem with my feature set or not. – Ayesh Weerasinghe Aug 01 '19 at 11:23
As mentioned, there is no real answer to the best k. Also, you can't expect the plot to look like a smooth elbow. Your data may contain 3 large feasible clusters where each of those could be divided into further 2 subclusters, making 6 clusters a feasible pick as well. You could try to PC plot your data to see if the number of clusters seems feasible, when comparing it to the elbow plot. – Bjarke Kingo Aug 01 '19 at 11:27

score 1 · Accepted Answer · answered Aug 06 '19 at 08:07

1

K-means is not suitable for outlier detection. This keeps popping up here all the time.

K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
K-means itself is known to not work well on noisy data where data points do not belong to the clusters
It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.

Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.

answered Aug 06 '19 at 08:07

Has QUIT--Anony-Mousse

76,138
12
138
194

My approach is to build a k-means cluster using pure data and then, identify outliers which are not inside a threshold squared error value using percentiles. Won't it work for identifying outliers? – Ayesh Weerasinghe Aug 07 '19 at 04:43
That is a semi-supervised approach. It may work on toy data where k-means works perfectly *and* where all clusters have the same variance. For example point 3 above still applies! In the more general semi-supervised setting, and with real data, I'd assume that one-class SVMs work *much* better. – Has QUIT--Anony-Mousse Aug 07 '19 at 12:55
I have done few modifications to the above k-means clustering model and tested for the conventional accuracy using a labeled dataset and did the same thing with the Local Outlier Factor(LOF). I got less accuracy for LOF as it makes a lot of false positives (detects normal data as outliers). What can be the reason for this? – Ayesh Weerasinghe Aug 21 '19 at 11:36
Overfitting, and bad labels. Maybe your labels don't label all anomalies? – Has QUIT--Anony-Mousse Aug 21 '19 at 11:43

Selecting the K value for Kmeans clustering

2 Answers2