Average Silhouette Values for k-means exceed the permissible range of -1 to + 1

Question

I am trying to solve a clustering problem, containing binary independent variables only in R. I only have basic understanding in R. Using R code which tries to execute the steps given below, I observed that the silhouette coefficient for few initial iterations exceeds its permissible range. Attached is the snapshot of the same.

Step followed:

Calculate the distance matrix containing the jaccard dissimilarities between each pair of records. function: vegandist from package: vegan.
Use the distance matrix for k-means and run the k-means multiple times say from 1 to 12. function: kmeansruns() from package: fpc
Capture the average silhouette width (asw) for each iteration and identify the best iteration which gives the maximum silhouette.
Do cross validation for this 'k' (found from step 3), to judge stability of clusters using 100 iterations and bootstrapped samples only.

I find that the silhouette values in the k-means (X axis) versus asw (Y axis) shows [k_versus_asw.jpeg] inconsistent average silhouette values.

Can some one please help on what could be going wrong here? Or is there any other clustering algorithm that should be used?

Attaching the code and sample data for this analysis:

Code:

> ###############################################
> 
> library(vegan) library(fpc) library(reshape2) library(ggplot2)
> 
> dist <- vegdist(mydat2, method = "jaccard") clustering.asw <-
> kmeansruns(dist, krange = 1:12, criterion = "asw")
> clustering.asw$bestk
> 
> critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))
> 
> critframe <- melt(critframe, id.vars = c("k"), variable.name =
> "measure", value.name = "score")
> 
> ggplot(critframe, aes(x=k, y=score, color=measure)) +  
> geom_point(aes(shape=measure)) + geom_line(aes(linetype=measure)) +  
> scale_x_continuous(breaks=1:12, labels=1:12)
> 
> summary(clustering.asw)
> 
> kbest.p <- 2
> 
> cboot <- clusterboot(dist, clustermethod = kmeansCBI, runs = 100,
> iter.max = 100, krange=kbest.p, seed = 12345) groups <-
> cboot$result$partition
> 
> print(cboot$result$partition, kbest.p)
> 
> cboot$bootmean
> 
> cboot$bootbrd
> 
> ####################################################

Sample Data:

ID V1 V2 V3 V4 V5 1 0 1 0 1 0 2 0 1 0 0 1 3 0 0 0 0 0 4 1 0 0 1 0 5 1 0 1 1 0 6 0 1 0 0 0 7 0 0 0 0 0 8 0 0 0 0 1 9 0 0 1 0 0 10 0 1 0 1 0 11 0 0 0 0 0 12 1 0 0 0 1 13 1 0 0 0 0 14 1 1 0 0 0 15 0 0 0 0 0 16 0 0 0 0 0 17 0 0 0 0 0 18 0 0 1 1 0 19 0 0 0 1 1 20 0 1 0 1 0

There are 40 such binary columns and around 350+ observations.

[reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — shekeine, Oct 06 '15 at 09:51

score 1 · Answer 1 · answered Oct 06 '15 at 10:18

1

k-means cannot use a distance matrix. It only works with squared Euclidean distance (and equivalent distances that are Euclidean in some kernel space, where the kernel preserves the mean).

It computes point-to-mean distances, not point-to-point. Therefore, a distance matrix is useless.

Nevertheless, the Silhouette should be in [-1:+1], so there is something incorrect in the code that you are using - please, look at the code, don't treat it as a black box.

answered Oct 06 '15 at 10:18

Has QUIT--Anony-Mousse

76,138
12
138
194

Agree that k-means shouldn't be using dissimilarity between case to case. In such a case will k-medoids help, I know that it finds representative observations from the data itself rather than using randomly assigned cluster centers. Or hierarchical clustering should work? – AJosh Oct 06 '15 at 12:47

score 0 · Answer 2 · answered Aug 28 '18 at 17:28

0

The error is in:

critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))

When you normalize the Silhouette values you remove the [-1,1] limits - and also make it very hard to interpret.

answered Aug 28 '18 at 17:28

André Costa

377
1
11

Average Silhouette Values for k-means exceed the permissible range of -1 to + 1

2 Answers2