2

I have the following simple problem. I have a distance matrix for a number of nodes, and I would like to get a list of subsets of this nodes such that within each subset, every two nodes are at a minimum distance dmin. That is, initially each two nodes are connected by an edge that has an associated value. I want to remove every edge for which the value is smaller than dmin, and list all the resulting disconnected graphs.

Essentially I want to get clusters of data points that are really close to each other, not with a clustering algorithm, but instead by using a threshold value for the distance.

My question is, naturally, how can I accomplish it in R. Consider the following matrix m:

    a   b   c   d
a 1.0 0.9 0.2 0.3
b 0.9 1.0 0.4 0.1
c 0.2 0.4 1.0 0.7
d 0.3 0.1 0.7 1.0

There are four nodes (a, b, c, d). I search a function or package that given that matrix (which is in fact 1 - distance matrix) and a threshold dmin, for example dmin <- 0.5, would produce two sets: {a,b} and {c,d}. A very inefficient way of accomplishing it is as follows:

clusters <- list()
nodes <- colnames( m )
dmin <- 0.5

# loop over nodes
for( n in nodes ) {

  found <- FALSE
  # check whether a node can be associated to one of the existing
  # clusters
  for( c in names( clusters ) ) {
    if( any( m[ n, clusters[[c]] ] > 0.5 ) ) {
      clusters[[c]] <- c( clusters[[c]], n )
      found <- TRUE
      next
    }
  }

  # no luck? create a new cluster for that node
  if( ! found )
    clusters[[n]] <- c( n )
} 

The result would be

> clusters
$a
[1] "a" "b"

$c
[1] "c" "d"
January
  • 16,320
  • 6
  • 52
  • 74
  • And your question is? Perhaps making a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) would be a good idea. – Thomas Aug 07 '13 at 13:15
  • Your question is unclear: in the first paragraph, you ask for subsets of points that are distant from one another ("at a minimum distance `dmin`" -- it is a [graph colouring problem](https://en.wikipedia.org/wiki/Graph_coloring), for the graph whose edges have length at most `dmin`), but in the second, you ask for "clusters of points that are close to each other". – Vincent Zoonekynd Aug 07 '13 at 13:36
  • I'm sorry for that confusion. @VincentZoonekynd: yep, graph colouring problem it is, I think. The second formulation is not exact. The confusion is due to the fact that the elements of matrix m are 1 - distances (actually, `1 - cor( y )^2` where y contains one row of measurements per node). I want to find groups of nodes which are highly correlated -- by cutting the graph. Yes, there are other clustering approaches, I am using them, but I wanted to try this as well. – January Aug 07 '13 at 13:47
  • @Thomas I did as you have suggested. – January Aug 07 '13 at 13:48

1 Answers1

2

From your similarity matrix m, you can build the adjacency matrix as m > .5, construct the corresponding graph with the igraph package and extract its connected components.

m <- matrix(c(10,9,2,3, 9,10,4,1, 2,4,10,7, 3,1,7,10), 4, 4)/10
colnames(m) <- rownames(m) <- letters[1:4]
library(igraph)
g <- graph.adjacency( m > .5 )
plot(g)
clusters(g)$membership
# [1] 1 1 2 2
tapply(colnames(m), clusters(g)$membership, c)
# $`1`
# [1] "a" "b"
# $`2`
# [1] "c" "d"
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • Yes! that's it. Thanks for being patient. As a biologist, I often find myself lacking the right terminology (otherwise I would be able to google it...) – January Aug 07 '13 at 13:51