3

I have a dissimilarity matrix and I want to run hierarchical clustering using that matrix as the only input as I don't know the source data itself. For background, I aim at clustering elements using their mutual correlation as distance. Following the methodology indicate in here, I'm using the correlation matrix to compute the dissimilarity matrix to be given to hclust as input. This is working fine.

My question is: how do I find the optimal number of clusters? Is there an index that can be computed by only knowing the dissimilarity matrix? The indices in NbClust require the source data to run - it is not enough to know the dissimilarity matrix. Is there any other method I can use in R?

Andrea
  • 31
  • 5
  • 2
    How do you define optimal or best number of clusters? – LauriK Jan 27 '15 at 11:37
  • @LauriK I would choose the number of clusters by using any of the many indices that have been developed with this purpose, like the ones available in [NbClust](http://cran.r-project.org/web/packages/NbClust/NbClust.pdf). My problem is that I need to find an index that doesn't require the original data set but only the dissimilarity matrix. – Andrea Jan 27 '15 at 12:05
  • What does hierarchical clustering have to do with your question? You don't need to set a number of clusters for HC – Lev Kuznetsov Jan 27 '15 at 12:16

1 Answers1

0

By just quickly looking at NbClust documentation it appears doable to only provide with the dissimilarity matrix omitting the original data source.

NbClust(data = NULL, diss = XYZ, distance = NULL ... etc

As the matrix is supplied (here referred to as XYZ), data and distance must be set to NULL. This is stated in the function Usage. NbClust should then be able to produce the partition index you are after.

  • NbClust package provides 30 indices for determining the optimal number of clusters in a data set and proposes/offers to user the best clustering scheme from the different results obtained by varying all combinations of number of clusters, distance measures, and clustering methods. – Ana Maria Mendes-Pereira Jun 02 '16 at 20:22
  • For the actual publication and detailed explanation on **Relevant Number of Clusters** please refer to: [NbClust: An R Package for Determining the Relevant Number of Clusters in a Data Set] (https://www.jstatsoft.org/article/view/v061i06), authors even provide with simulated data in the submitted supplementary material. – Ana Maria Mendes-Pereira Jun 02 '16 at 20:25