0

Possible Duplicate:
How to apply a hierarchical or k-means cluster analysis using R?

Consider these four matrices with the same number of columns but different numbers of rows

library(gtools)

m1 <- matrix(sample(c(-1, 0, 1), 15, replace=T), 3)
m2 <- matrix(sample(c(-1, 0, 1), 25, replace=T), 5)
m3 <- matrix(sample(c(-1, 0, 1), 25, replace=T), 5)
m4 <- matrix(sample(c(-1, 0, 1), 30, replace=T), 6)   
rownames(m1) <- c(1:3)
rownames(m2) <- c(4:8)
rownames(m3) <- c(9:13)
rownames(m4) <- c(14:19)

I want to apply hclust() to these four matrices when arranged in the following format:

mat <- list(m1, m2, m3, m4)

unite <- rbind(m1,m2,m3, m4)
rownames(unite) <- c(1:19)
distUnite <- as.matrix(dist(unite, method="manhattan"))

## empty matrix for storing the distance between pairwise matrices
dist4m <- matrix(0, nrow=4, ncol=4)
indices <- combinations(4,2)
distance <- apply(indices, 1,
                  function(pair){
                      print(pair)
                      s1=pair[1]
                      s2=pair[2] 
                      pairmean <- mean(distReads[which(m$Sample==samples[s1]), which(m$Sample==samples[s2])])

                      dist4m[s1,s2] <<- pairmean
                      dist4m[s2,s1] <<- pairmean
                  })

print(dist4m)
## then use hclust(), and plot()     

The above script should work, but I am wondering whether there is more efficient and reliable method to solve?

Thank you for your advices.

Community
  • 1
  • 1
Matt
  • 831
  • 2
  • 7
  • 16
  • 1
    You'll find some help with the clustering part of this answered [here](http://stackoverflow.com/q/5648383/429846) – Gavin Simpson Oct 17 '12 at 09:07
  • Thanks for you info. But there are different questions. – Matt Oct 17 '12 at 10:12
  • I appreciate the dissimilarity bit is different. The clustering bit is not. – Gavin Simpson Oct 17 '12 at 11:07
  • So is the question; how to perform cluster analysis on the matrix `dist4m`? – Gavin Simpson Oct 17 '12 at 14:36
  • hclust(dist4m) is not a problem. My question is how to get dist4m efficiently? – Matt Oct 17 '12 at 14:48
  • Well you haven't really stated that clearly as yet. So that I don't have to guess or decipher what you are using to determine the dissimilarity between matrices is there a name for the method you are using to compare the matrices? – Gavin Simpson Oct 17 '12 at 14:55
  • Compute the similarity of two matrices, this is what I knew. – Matt Oct 17 '12 at 15:00
  • 1
    Now Jim, you are being obtuse. How are you trying to compute the similarity between the matrices? What is the metric or method by which you are defining similarity of two matrices. For a single matrix I might state that I define (dis)similarity as the Euclidean distance between vectors. You are doing all sorts of sampling. If I don't know what you are trying to achieve (and I can't tell from the code!) you need to tell us how you define similarity between matrices so we can see if there are more efficient ways. – Gavin Simpson Oct 17 '12 at 15:09
  • Thanks for your suggestions. What I am looking for is related to: http://users.mccammon.ucsd.edu/~bgrant/bio3d/html/dist.xyz.html I will try this method here.. – Matt Oct 17 '12 at 15:33

1 Answers1

5

Grouping them (I'm assuming you want to cbind and fill):

m.list <- list(m1,m2,m3,m4)
n <- max(sapply(m.list, nrow))
m.all <- do.call(cbind, lapply(m.list, function (x)
rbind(x, matrix(, n-nrow(x), ncol(x))))) 

m.dist <- dist(m.all)
m.hclust <- hclust(m.dist)
plot(m.hclust)

enter image description here

Individually:

m1 <- matrix(sample(c(-1, 0, 1), 15, replace=T), 3) 
m2 <- matrix(sample(c(-1, 0, 1), 25, replace=T), 5)
m3 <- matrix(sample(c(-1, 0, 1), 25, replace=T), 5)
m4 <- matrix(sample(c(-1, 0, 1), 30, replace=T), 6)

m1.dist <- dist(m1)
m2.dist <- dist(m2)
m3.dist <- dist(m3)
m4.dist <- dist(m4)

m1.hclust <- hclust(m1.dist)
m2.hclust <- hclust(m2.dist)
m3.hclust <- hclust(m3.dist)
m4.hclust <- hclust(m4.dist)

plot(m1.hclust)
plot(m2.hclust)
plot(m3.hclust)
plot(m4.hclust)

enter image description here enter image description here enter image description here enter image description here

Brandon Bertelsen
  • 43,807
  • 34
  • 160
  • 255
  • Thanks Brandon! The first method is very good,it makes more sense for me.... Is there any other method to calculate the distance matrix of (m1 and m2,m3,m4; m2 and m3, m4; m3 and m4)? Without introducing NA.. – Matt Oct 17 '12 at 07:28
  • 1
    You could convert na to zero. `m.all[is.na(m.all)] <- 0` – Brandon Bertelsen Oct 17 '12 at 13:41
  • I added one algorithm here: http://stackoverflow.com/questions/12936050/hierarchical-clustering-based-on-the-similarity-of-different-matrices kind of complicated, thus wanna know whether there is a better way to get the distance matrix.. – Matt Oct 17 '12 at 14:00
  • Sorry,I'm not really clear on what you would like past what I've already provided. You can't calculate a distance matrix between two or more matrices without combining them somehow. – Brandon Bertelsen Oct 17 '12 at 14:02
  • you are right. But I am not sure whether I can still use your method, when two matrices have a big difference. Eg. dim(A)=(4,5), dim(B)=(15,5). – Matt Oct 17 '12 at 14:47
  • Hi Brandon, the first method(grouping the matrices) you computed the pairwise distance of 6 rows from the combined matrix, so it is not really the pairwise distance of 4 matrices.. – Matt Oct 17 '12 at 15:08
  • 1
    It's starting to sound like your problem isn't about clustering. It's about how to merge data. If so, you should start a new question that explains the shape the matrix needs to be in order to calculate dist() correctly. – Brandon Bertelsen Oct 17 '12 at 16:16
  • Thanks, it is a good idea to start a new question. – Matt Oct 17 '12 at 21:47