I am trying to solve a clustering problem, containing binary independent variables only in R. I only have basic understanding in R. Using R code which tries to execute the steps given below, I observed that the silhouette coefficient for few initial iterations exceeds its permissible range. Attached is the snapshot of the same.
Step followed:
- Calculate the distance matrix containing the jaccard dissimilarities between each pair of records. function: vegandist from package: vegan.
- Use the distance matrix for k-means and run the k-means multiple times say from 1 to 12. function: kmeansruns() from package: fpc
- Capture the average silhouette width (asw) for each iteration and identify the best iteration which gives the maximum silhouette.
- Do cross validation for this 'k' (found from step 3), to judge stability of clusters using 100 iterations and bootstrapped samples only.
I find that the silhouette values in the k-means (X axis) versus asw (Y axis) shows [k_versus_asw.jpeg] inconsistent average silhouette values.
Can some one please help on what could be going wrong here? Or is there any other clustering algorithm that should be used?
Attaching the code and sample data for this analysis:
Code:
> ###############################################
>
> library(vegan) library(fpc) library(reshape2) library(ggplot2)
>
> dist <- vegdist(mydat2, method = "jaccard") clustering.asw <-
> kmeansruns(dist, krange = 1:12, criterion = "asw")
> clustering.asw$bestk
>
> critframe <- data.frame(k = 1:12, asw = scale(clustering.asw$crit))
>
> critframe <- melt(critframe, id.vars = c("k"), variable.name =
> "measure", value.name = "score")
>
> ggplot(critframe, aes(x=k, y=score, color=measure)) +
> geom_point(aes(shape=measure)) + geom_line(aes(linetype=measure)) +
> scale_x_continuous(breaks=1:12, labels=1:12)
>
> summary(clustering.asw)
>
> kbest.p <- 2
>
> cboot <- clusterboot(dist, clustermethod = kmeansCBI, runs = 100,
> iter.max = 100, krange=kbest.p, seed = 12345) groups <-
> cboot$result$partition
>
> print(cboot$result$partition, kbest.p)
>
> cboot$bootmean
>
> cboot$bootbrd
>
> ####################################################
Sample Data:
ID V1 V2 V3 V4 V5 1 0 1 0 1 0 2 0 1 0 0 1 3 0 0 0 0 0 4 1 0 0 1 0 5 1 0 1 1 0 6 0 1 0 0 0 7 0 0 0 0 0 8 0 0 0 0 1 9 0 0 1 0 0 10 0 1 0 1 0 11 0 0 0 0 0 12 1 0 0 0 1 13 1 0 0 0 0 14 1 1 0 0 0 15 0 0 0 0 0 16 0 0 0 0 0 17 0 0 0 0 0 18 0 0 1 1 0 19 0 0 0 1 1 20 0 1 0 1 0
There are 40 such binary columns and around 350+ observations.