0

I am looking for the optimal number of clusters to be used for a cluster analysis. Therefore I am using a scree plot created by the following code:

WSS = sapply(1:20, FUN=function(k) {
  kmeans(df, centers=k, nstart=5)$tot.withinss
})
plot(1:20, WSS, type="l")
abline(v=5, col="red", lty=2)

Now, I know that the bend/knee in the plot indicates the optimal number of clusters. Is there any way of retrieving this value without graphical inspection?

Data looks as follows

df = structure(list(Birthrate = c(18.2, 8.5, 54.1, 1.4, 2.1, 83.6, 
17, 1, 0.8, 61.7, 4.9, 7.9, 2, 14.2, 48.2, 17.1, 10.4, 37.5, 
1.6, 49.5, 10.8, 6.2, 7.1, 7.8, 3, 3.7, 4.2, 8.7), GDP = c(1.22, 
0.06, 0, 0.54, 2.34, 0.74, 1.03, 1.21, 0, 0.2, 1.41, 0.79, 2.75, 
0.03, 11.13, 0.05, 2.99, 0.71, 0, 0.9, 1.15, 0, 1.15, 1.44, 0, 
0.71, 1.21, 1.45), Income = c(11.56, 146.75, 167.23, 7, 7, 7, 
10.07, 7, 7, 7, 47.43, 20.42, 7.52, 7, 7, 15.98, 15.15, 20.42, 
7, 22.6, 7, 7, 18.55, 7, 7.7, 7, 7, 7), Population = c(54, 94, 
37, 95, 98, 31, 78, 97, 95, 74, 74, 81, 95, 16, 44, 63, 95, 20, 
95, 83, 98, 98, 84, 62, 98, 98, 97, 98)), .Names = c("Birthrate", 
"GDP", "Income", "Population"), class = "data.frame", row.names = c(NA, 
-28L))
Jonathan Rhein
  • 1,616
  • 3
  • 23
  • 47
  • related question: http://stackoverflow.com/questions/15376075/cluster-analysis-in-r-determine-the-optimal-number-of-clusters – C8H10N4O2 May 03 '16 at 19:19

1 Answers1

2

You could find the index of the value where the second difference of successive values switches from positive to negative. This would be the discrete equivalent of where the second derivative switches from positive to negative:

To get the index of the knee value (i.e., the value chosen as the optimal number of clusters):

min(which(diff(diff(WSS)) < 0)) - 3 

To get the value of WSS at the knee:

WSS[min(which(diff(diff(WSS)) < 0)) - 3]

The - 3 is because taking the second difference shortens the resulting vector by 2, and then we subtract one more to get the index of the first of two points across which the second difference goes negative.

I'm not sure how robust this is, and I'd be surprised if someone hasn't come up with a better algorithm, but perhaps this will get you started.

eipi10
  • 91,525
  • 24
  • 209
  • 285