10

I have a set of data containing: item, associated cluster, silhouette coefficient. I can further augment this data set with more information if necessary.

I would like to generate a silhouette plot in R. I am having trouble with this because examples I came across use the built-in kmeans (or related) clustering function and plot the result. I want to bypass this step and produce the plot for my own clustering algorithm but I'm ending up short on providing the correct arguments to the plot function.

Thank you.

EDIT

Data set example https://pastebin.mozilla.org/8853427

What I've tried is loading the dataset and passing it to the plot function using various arguments based on https://stat.ethz.ch/R-manual/R-devel/library/cluster/html/silhouette.html

andrei
  • 8,252
  • 14
  • 51
  • 66
  • Please provide some of your data and the code you tried – etienne Nov 30 '15 at 13:03
  • 1
    Here's how to create a [reproducible example in R](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). It makes it easier for others to help you. – Heroka Nov 30 '15 at 13:04

1 Answers1

14

Function silhouette in package cluster can do the plots for you. It just needs a vector of cluster membership (produced from whatever algorithm you choose) and a dissimilarity matrix (probably best to use the same one used in producing the clusters). For example:

library (cluster)
library (vegan)
data(varespec)
dis = vegdist(varespec)
res = pam(dis,3) # or whatever your choice of clustering algorithm is
sil = silhouette (res$clustering,dis) # or use your cluster vector
windows() # RStudio sometimes does not display silhouette plots correctly
plot(sil)

EDIT: For k-means (which uses squared Euclidean distance)

library (vegan)
library (cluster)
data(varespec)
dis = dist(varespec)^2
res = kmeans(varespec,3)
sil = silhouette (res$cluster, dis)
windows() 
plot(sil)
Philip Perrin
  • 388
  • 2
  • 12
  • 1
    Can you go into a bit more detail about the code? What will `dis` contain what will `res` contain? – andrei Dec 08 '15 at 09:17
  • 1
    `dis` will be a distance/dissimilarity matrix of class `dist`. See `?vegdist` for details. `res` in this case is the results object of `pam` (partitioning around medoids); within this `clustering` is a vector containing the identities of the clusters to which each sample has been assigned. Whatever algorithm you are using, you need to extract the cluster membership vector from the results. Which method do you hope to use? – Philip Perrin Dec 08 '15 at 21:02
  • Data is already clustered using Kmeans. And the silhouette coefficient has been computed. I'll accept your answer as soon as I get a chance to test this out and see it works. – andrei Dec 09 '15 at 08:26