3

I'm using R to perform an hierarchical clustering. As a first approach I used hclust and performed the following steps:

  1. I imported the distance matrix
  2. I used the as.dist function to transform it in a dist object
  3. I run hclust on the dist object

Here's the R code:

distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)
hclust(d, "ward")

At this point I would like to do something similar with the function pvclust; however, I cannot because it's not possible to pass a precomputed dist object. How can I proceed considering that I'm using a distance not available among those provided by the dist function of R?

Argalatyr
  • 4,639
  • 3
  • 36
  • 62
rlar
  • 856
  • 1
  • 10
  • 15

3 Answers3

3

I've tested the suggestion of Vincent, you can do the following (my data set is a dissimilarity matrix):

# Import you data
distm <- read.csv("distMatrix.csv")
d <- as.dist(distm)

# Compute the eigenvalues
x <- cmdscale(d,1,eig=T)

# Plot the eigenvalues and choose the correct number of dimensions (eigenvalues close to 0)
plot(x$eig, 
   type="h", lwd=5, las=1, 
   xlab="Number of dimensions", 
   ylab="Eigenvalues")

# Recover the coordinates that give the same distance matrix with the correct number of dimensions    
x <- cmdscale(d,nb_dimensions)

# As mentioned by Stéphane, pvclust() clusters columns
pvclust(t(x))
Gautier Drusch
  • 701
  • 5
  • 16
2

If the dataset is not too large, you can embed your n points in a space of dimension n-1, with the same distance matrix.

# Sample distance matrix
n <- 100
k <- 1000
d <- dist( matrix( rnorm(k*n), nc=k ), method="manhattan" )

# Recover some coordinates that give the same distance matrix
x <- cmdscale(d, n-1)
stopifnot( sum(abs(dist(x) - d)) < 1e-6 )

# You can then indifferently use x or d
r1 <- hclust(d)
r2 <- hclust(dist(x)) # identical to r1
library(pvclust)
r3 <- pvclust(x)

If the dataset is large, you may have to check how pvclust is implemented.

Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • In retrospect (i.e., after having replied myself), I believe the OP really wants to pass a distance matrix to `pvclust`, whereas `pvclust` expects a data.frame or matrix object. – chl Jan 19 '12 at 14:53
  • Be careful: pvclust() clusters columns, not rows, hence the good code is pvclust(t(x)), not pvclust(x) – Stéphane Laurent Jun 12 '12 at 13:51
1

It's not clear to me whether you only have a distance matrix, or you computed it beforehand. In the former case, as already suggested by @Vincent, it would not be too difficult to tweak the R code of pvclust itself (using fix() or whatever; I provided some hints on another question on CrossValidated). In the latter case, the authors of pvclust provide an example on how to use a custom distance function, although that means you will have to install their "unofficial version".

Community
  • 1
  • 1
chl
  • 27,771
  • 5
  • 51
  • 71
  • I've seen the unofficial version, however I would prefer to avoid using it... After posting on stackoverflow I contacted the author of the pvclust function. This is his answer: Since pvclust uses a bootstrap-based algorithm, using precomputed dist object is impossible in principle. I'm sorry I cannot be of help. – rlar Jan 20 '12 at 09:13