4

I am looking for a way to find clusters of group 2 (pairs). Is there a simple way to do that?

Imagine I have some kind of data where I want to match on x and y, like

library(cluster)
set.seed(1)

df = data.frame(id = 1:10, x_coord = sample(10,10), y_coord = sample(10,10))

I want to find the closest pair of distances between the x_coord and y_coord:

d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

I get a dendrogram like the one below. What I would like is that the pairs (9,10), (1,3), (6,7), (4,5) be grouped together. And that in fact the cases 8 and 2, be left alone and removed.

Maybe there is a more effective alternative for doing this than clustering.

Ultimately I would like is to remove the unmatched ids and keep the pairs and have a dataset like this one:

  id x_coord y_coord  pair_id
   1       9       3  1
   3       7       5  1 
   4       1       8  2
   5       2       2  2
   6       5       6  3
   7       3      10  3 
   9       6       4  4
  10       8       7  4

enter image description here

giac
  • 4,261
  • 5
  • 30
  • 59

1 Answers1

3

You could use the element h$merge. Any rows of this two-column matrix that both contain negative values represent a pairing of singletons. Therefore you can do:

pairs   <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]

df
#>    id x_coord y_coord pair
#> 1   1       9       3    4
#> 3   3       7       5    4
#> 4   4       1       8    1
#> 5   5       2       2    1
#> 6   6       5       6    2
#> 7   7       3      10    2
#> 9   9       6       4    3
#> 10 10       8       7    3

Note that the pair numbers equate to "height" on the dendrogram. If you want them to be in ascending order according to the order of their appearance in the dataframe you can add the line

df$pair <- as.numeric(factor(df$pair, levels = unique(df$pair)))

Anyway, if we repeat your plotting code on our newly modified df, we can see there are no unpaired singletons left:

d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

enter image description here

And we can see the method scales nicely:

df = data.frame(id = 1:50, x_coord = sample(50), y_coord = sample(50))
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
pairs   <- -h$merge[apply(h$merge, 1, function(x) all(x < 0)),]
df$pair <- (match(df$id, c(pairs)) - 1) %% nrow(pairs) + 1
df <- df[!is.na(df$pair),]
d = stats::dist(df[,c(1,2)], diag = T)
h = hclust(d)
plot(h)

enter image description here

Allan Cameron
  • 147,086
  • 7
  • 49
  • 87
  • Fantastic answer. Thank you! – Werner Hertzog Sep 12 '20 at 21:00
  • Very nice, thank you. Someone should write a general function or general package for finding fixed numbers of units in groups with clustering. Can you generalised from your code? – giac Sep 14 '20 at 06:43
  • 1
    @giac I think I can see a way of doing it by generalizing this method, but it would be more complex. It would involve "walking up" the tree from the leaf nodes. There is enough information in the `merge` object to allow this. I think a single moderately long function would suffice rather than a full package, but it's not trivial. – Allan Cameron Sep 14 '20 at 07:27