I'm doing hierarchical clustering with an R package called pvclust
, which builds on hclust
by incorporating bootstrapping to calculate significance levels for the clusters obtained.
Consider the following data set with 3 dimensions and 10 observations:
mat <- as.matrix(data.frame("A"=c(9000,2,238),"B"=c(10000,6,224),"C"=c(1001,3,259),
"D"=c(9580,94,51),"E"=c(9328,5,248),"F"=c(10000,100,50),
"G"=c(1020,2,240),"H"=c(1012,3,260),"I"=c(1012,3,260),
"J"=c(984,98,49)))
When I use hclust
alone, the clustering runs fine for both Euclidean measures and correlation measures:
# euclidean-based distance
dist1 <- dist(t(mat),method="euclidean")
mat.cl1 <- hclust(dist1,method="average")
# correlation-based distance
dist2 <- as.dist(1 - cor(mat))
mat.cl2 <- hclust(dist2, method="average")
However, when using the each set up with pvclust
, as follows:
library(pvclust)
# euclidean-based distance
mat.pcl1 <- pvclust(mat, method.hclust="average", method.dist="euclidean", nboot=1000)
# correlation-based distance
mat.pcl2 <- pvclust(mat, method.hclust="average", method.dist="correlation", nboot=1000)
... I get the following errors:
- Euclidean:
Error in hclust(distance, method = method.hclust) : must have n >= 2 objects to cluster
- Correlation:
Error in cor(x, method = "pearson", use = use.cor) : supply both 'x' and 'y' or a matrix-like 'x'
.
Note that the distance is calculated by pvclust
so there is no need for a distance calculation beforehand. Also note that the hclust
method (average, median, etc.) does not affect the problem.
When I increase the dimensionality of the data set to 4, pvclust
now runs fine. Why is it that I'm getting these errors for pvclust
at 3 dimensions and below but not for hclust
? Furthermore, why do the errors disappear when I use a data set above 4 dimensions?