-2

I am trying to solve a clustering problem, containing binary independent variables only in R. I only have basic understanding in R. Using R code which tries to execute the steps given below, I observed that the silhouette coefficient for few initial iterations exceeds its permissible range. Attached is the snapshot of the same.

Step followed:

  1. Calculate the distance matrix containing the jaccard dissimilarities between each pair of records. function: vegandist from package: vegan.
  2. Use the distance matrix for k-means and run the k-means multiple times say from 1 to 12. function: kmeansruns() from package: fpc
  3. Capture the average silhouette width (asw) for each iteration and identify the best iteration which gives the maximum silhouette.
  4. Do cross validation for this 'k' (found from step 3), to judge stability of clusters using 100 iterations and bootstrapped samples only.

I find that the silhouette values in the k-means (X axis) versus asw (Y axis) shows [k_versus_asw.jpeg] inconsistent average silhouette values.

Can some one please help on what could be going wrong here? Or is there any other clustering algorithm that should be used?

Attaching the code and sample data for this analysis:

Code:

> ###############################################
> 
> library(vegan) library(fpc) library(reshape2) library(ggplot2)
> 
> dist <- vegdist(mydat2, method = "jaccard") clustering.asw <-
> kmeansruns(dist, krange = 1:12, criterion = "asw")
> clustering.asw$bestk
> 
> critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))
> 
> critframe <- melt(critframe, id.vars = c("k"), variable.name =
> "measure", value.name = "score")
> 
> ggplot(critframe, aes(x=k, y=score, color=measure)) +  
> geom_point(aes(shape=measure)) + geom_line(aes(linetype=measure)) +  
> scale_x_continuous(breaks=1:12, labels=1:12)
> 
> summary(clustering.asw)
> 
> kbest.p <- 2
> 
> cboot <- clusterboot(dist, clustermethod = kmeansCBI, runs = 100,
> iter.max = 100, krange=kbest.p, seed = 12345) groups <-
> cboot$result$partition
> 
> print(cboot$result$partition, kbest.p)
> 
> cboot$bootmean
> 
> cboot$bootbrd
> 
> ####################################################

Sample Data:

ID V1 V2 V3 V4 V5 1 0 1 0 1 0 2 0 1 0 0 1 3 0 0 0 0 0 4 1 0 0 1 0 5 1 0 1 1 0 6 0 1 0 0 0 7 0 0 0 0 0 8 0 0 0 0 1 9 0 0 1 0 0 10 0 1 0 1 0 11 0 0 0 0 0 12 1 0 0 0 1 13 1 0 0 0 0 14 1 1 0 0 0 15 0 0 0 0 0 16 0 0 0 0 0 17 0 0 0 0 0 18 0 0 1 1 0 19 0 0 0 1 1 20 0 1 0 1 0

There are 40 such binary columns and around 350+ observations.

AJosh
  • 11
  • 4
  • [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – shekeine Oct 06 '15 at 09:51

2 Answers2

1

k-means cannot use a distance matrix. It only works with squared Euclidean distance (and equivalent distances that are Euclidean in some kernel space, where the kernel preserves the mean).

It computes point-to-mean distances, not point-to-point. Therefore, a distance matrix is useless.

Nevertheless, the Silhouette should be in [-1:+1], so there is something incorrect in the code that you are using - please, look at the code, don't treat it as a black box.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Agree that k-means shouldn't be using dissimilarity between case to case. In such a case will k-medoids help, I know that it finds representative observations from the data itself rather than using randomly assigned cluster centers. Or hierarchical clustering should work? – AJosh Oct 06 '15 at 12:47
0

The error is in:

critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))  

When you normalize the Silhouette values you remove the [-1,1] limits - and also make it very hard to interpret.

André Costa
  • 377
  • 1
  • 11