1

I have the following vector:

wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287, 
216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227, 
118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563, 
116.448588269066, 88.1287299776581, 83.9345098736843)

And if we with the following plot code

plot(1:15, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

we can get this:

enter image description here

By eye we can see at x-axis point 4 the value change begin to change drastically plateaued.

My question is given the vector wss how can we automatically detect the index 4 without looking at the plot.

littleworth
  • 4,781
  • 6
  • 42
  • 76

1 Answers1

1

Edit: This works better:

#change relative to the maximum change
threshold <- 0.1

d1 <- diff(wss)

# this assumes that the first value is the highest
## you could use max(d1) instead of d1[1]

which.max((d1 / d1[1]) < threshold)  #results in 3

d1 <- diff(wss2)
which.max(d1 / d1[1] < threshold) #results in 5

Second Edit: This is somewhat subjective, but here's how my three methods compare for your two data sets. While it's easy to visualize what a plateau is, you need to be able to describe in math terminology what a plateau is in order to automate it.

Plateau Finding Methods

Original: If you know that the second derivative will flip from positive to negative, you can do this:

sec_der <- diff(wss, differences = 2)
inflection_pt <- which.min(sign(sec_der))

inflection_pt

For this data set, the result is 5 which corresponds to the original datasets result of 7 (i.e., 151.991).

Instead of looking at inflection points, you could instead look at some relative percent threshold.

thrshold <- 0.06

which.min(sign(abs(diff(wss)) / wss[1:(length(wss)-1)] - thrshold))

This results in 5 as well using the first derivative approach.

Regardless, using the diff() function would be a key part of figuring this out in base R. Also see:

Finding the elbow/knee in a curve

Code to create graphs:

wss <- c(23265.2302840678, 4917.06943551649, 1330.49917983449, 288.050702912287, 
         216.182464712486, 203.769578557051, 151.991297068931, 139.635571841227, 
         118.285305833194, 117.164567420633, 105.397722980407, 95.4682187817563, 
         116.448588269066, 88.1287299776581, 83.9345098736843)

wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)

data_list <- list(wss, wss2)

# Potential_methods -------------------------------------------------------

plateau_method = list(thresh_to_max = function(x) which.max(diff(x) / diff(x)[1] < threshold)
                      , inflection_pt = function(x) which.min(sign(diff(x, differences = 2)))
                      , deriv_to_raw = function(x) which.min(sign(abs(diff(x)) / x[1:(length(x)-1)] - threshold))
)

threshold <- 0.1

results <- t(sapply(plateau_method, mapply, data_list))

# graphing ----------------------------------------------------------------

par(mfrow = c(3,2))

apply(results, 1, function (x) {
  for (i in seq_along(x)) {
    plot(data_list[[i]],ylab="Within groups sum of squares", type = 'b', xlab = 'Number of Clusters')
    abline(v = x[i])
  }
} )

lapply(seq_along(names(plateau_method))
       , function (i) {
         mtext(paste(names(plateau_method)[i]
                     , "- \n"
                     , substring(plateau_method[i], 15))
               , side = 3, line = -18*(i)+15, outer = TRUE)
         })

mtext('Threshold = 0.1', side = 3, line = -53, outer = T)
Cole
  • 11,130
  • 1
  • 9
  • 24
  • I tried this `wss2 <- c(1970.08410513303, 936.826421218935, 463.151086710784, 310.219800983285, 227.747583214178, 191.601552329558, 159.703151798393, 146.881710048563, 138.699803963718, 134.534334658148)` Your approach give index 1 where actuallly it's 4th indices. – littleworth Jun 10 '19 at 04:40
  • 1
    what is the purpose of `diff(diff(...))`???? It seems you are using `diff` for the first time. Do `diff(wss,difference=2)` – Onyambu Jun 10 '19 at 04:47
  • 1
    Tough crowd. See edit for a different algorithm - my answers are now 3 and 5 for your two vectors. Onyambu - you're right, this is the first time I've used it to try to calculate second derivatives. Thanks for pointing out the ```differences``` argument. – Cole Jun 10 '19 at 05:03
  • @Onyambu @Cole still with `wss2` it didn't get the desired value. – littleworth Jun 10 '19 at 05:19
  • @scamander you need to tune it. When I plotted wss2, I thought 5 was a fine answer. – Cole Jun 10 '19 at 10:22
  • See second edit and graph. Basically, you'll need to better define what a plateau is in order to automate it. – Cole Jun 10 '19 at 17:20