-1

I am going to build a K-means clustering model for outlier detection. For that, I need to identify the best number of clusters needs to be selected.

For now, I have tried to do this using Elbow Method. I plotted the sum of squared error vs. the number of clusters(k) but, I got a graph like below which makes confusion to identify the elbow point.

The sum of squared error vs. The number of clusters

I need to know, why do I get a graph like this and how do I identify the optimal number of clusters.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Ayesh Weerasinghe
  • 580
  • 2
  • 7
  • 19

2 Answers2

1

Remember that the Elbow Method doesn't just 'give' the best value of k, since the best value of k is up to interpretation.

The theory behind the Elbow Method is that we in tandem both want to minimize some error function (i.e. sum of squared errors) while also picking a low value of k.

The Elbow Method thus suggests that a good value of k would lie in a point on the plot that resembles an elbow. That is the error is small, but doesn't decrease drastically when k increases locally.

In your plot you could argue that both k=3 and k=6 resembles elbows. By picking k=3 you'd have picked a small k, and we see that k=4, and k=5 doesn't do much better in minimizing the error. Same goes with k=6.

Bjarke Kingo
  • 400
  • 7
  • 14
  • Thank you for your answer, but what I need to know is why the graph behaves like this? If I select k=3, the error is much higher and if I select k=6, the number of clusters is higher. I don't know whether this happens because of a problem with my feature set or not. – Ayesh Weerasinghe Aug 01 '19 at 11:23
  • As mentioned, there is no real answer to the best k. Also, you can't expect the plot to look like a smooth elbow. Your data may contain 3 large feasible clusters where each of those could be divided into further 2 subclusters, making 6 clusters a feasible pick as well. You could try to PC plot your data to see if the number of clusters seems feasible, when comparing it to the elbow plot. – Bjarke Kingo Aug 01 '19 at 11:27
1

K-means is not suitable for outlier detection. This keeps popping up here all the time.

  1. K-means is conceptualized for "pure" data, with no false points. All measurements are supposed to come from the data, and only vary by some Gaussian measurement error. Occasionally this may yield some more extreme values, but even these are real measurements, from the real clusters, and should be explained not removed.
  2. K-means itself is known to not work well on noisy data where data points do not belong to the clusters
  3. It tends to split large real clusters in two, and then points right in the middle of the real cluster will have a large distance to the k-means centers
  4. It tends to put outliers into their own clusters (because that reduces SSQ), and then the actual outliers will have a small distance, even 0.

Rather use an actual outlier detection algorithm such as Local Outlier Factor, kNN, LOOP etc. instead that were conceptualized with noisy data in mind.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • My approach is to build a k-means cluster using pure data and then, identify outliers which are not inside a threshold squared error value using percentiles. Won't it work for identifying outliers? – Ayesh Weerasinghe Aug 07 '19 at 04:43
  • That is a semi-supervised approach. It may work on toy data where k-means works perfectly *and* where all clusters have the same variance. For example point 3 above still applies! In the more general semi-supervised setting, and with real data, I'd assume that one-class SVMs work *much* better. – Has QUIT--Anony-Mousse Aug 07 '19 at 12:55
  • I have done few modifications to the above k-means clustering model and tested for the conventional accuracy using a labeled dataset and did the same thing with the Local Outlier Factor(LOF). I got less accuracy for LOF as it makes a lot of false positives (detects normal data as outliers). What can be the reason for this? – Ayesh Weerasinghe Aug 21 '19 at 11:36
  • Overfitting, and bad labels. Maybe your labels don't label all anomalies? – Has QUIT--Anony-Mousse Aug 21 '19 at 11:43