6

I'm trying to link together pairs of unique IDs using R. Given the example below, I have two IDs (here ID1 and ID2) that indicate linkage. I'm trying to create groups of rows that are linked. In this example A is linked to B which is linked to D which is linked to E. Because these are all connected, I want to group them together. Next, there is also X which is linked to both Y and Z. Because these two are also connected, I want to assign them to a single group as well. How can I tackle this using R?

Thanks!

Example data:

ID1 ID2
A   B
B   D
D   E
X   Y
X   Z

DPUT R representation

structure(list(id1 = structure(c(1L, 2L, 3L, 4L, 4L), .Label = c("A", "B", "D", "X"), class = "factor"), id2 = structure(1:5,.Label = c("B", "D", "E", "Y", "Z"), class = "factor")), .Names = c("id1", "id2"), row.names = c(NA, -5L), class = "data.frame")

Output needed:

ID1 ID2 GROUP
A   B   1
B   D   1
D   E   1
X   Y   2
X   Z   2
Floris
  • 637
  • 1
  • 8
  • 17
  • 3
    Use the igraph package and its tools for identifying connected components. https://en.wikipedia.org/wiki/Connected_component_(graph_theory) – Frank Jul 29 '16 at 16:08
  • 1
    @akrun The code of that answer does not seem to match what we have here (taking data through an erdos renyi game, whatever that is) and the question is asked on a much higher level (in terms of knowledge of graph theory) than the OP here or others who run into this problem, can be expected to have. I suspect there is a dupe but am not a fan of that one. Gonna revert for now. – Frank Jul 29 '16 at 16:57
  • 1
    @Frank/@Steven your answer works perfectly in my situation. My actual data is tens of thousands of pairs and several hundreds of groups. – Floris Jul 29 '16 at 18:15
  • What if there are some NA values in the dataset? How would the membership changes? – akrun Jul 29 '16 at 19:04
  • In my data there is no NA so I wouldn't know how that would affect the results. The linkage data frame only shows rows where id1 and id2 are linked. – Floris Jul 29 '16 at 19:26
  • @rui-barradas why would you close the older question in favor of the newer question? Not important, but just asking. Seems like the second is a duplicate of the first and not the other way around? – Floris Apr 29 '20 at 21:01
  • @8245406 does this work? – Floris Apr 29 '20 at 21:03

1 Answers1

14

As per mentionned by @Frank in the comments, you can use igraph:

library(igraph)
idf <- graph.data.frame(df)
clusters(idf)$membership

Which gives:

A B D X E Y Z 
1 1 1 2 1 2 2 

Should you want to assign the result back to rows of df:

merge(df, stack(clusters(idf)$membership), by.x = "id1", by.y = "ind", all.x = TRUE)
Steven Beaupré
  • 21,343
  • 7
  • 57
  • 77
  • OP apparently wants that assigned back to rows of `df`, maybe `merge(df, stack(clusters(idf)$membership), by.x="id1", by.y="ind", all.x=TRUE)` – Frank Jul 29 '16 at 16:17
  • This looks like a great solution! Thanks. Do you know if igraph and dplyr are compatible? I'm getting compatability warnings when I load igraph: The following objects are masked from ‘package:tidyr’: %>%, crossing The following objects are masked from ‘package:dplyr’: %>%, as_data_frame, groups, union – Floris Jul 29 '16 at 16:24
  • @Floris It is simply telling you that there are functions in the `igraph` package that are "masking" functions of the same name in the `dplyr` and `tidyr` packages. Functions in a package being loaded with the same names as functions in a package already loaded will mask those previously loaded functions. – Steven Beaupré Jul 29 '16 at 16:27
  • 2
    Yes I know this and was worried igraph functions would break dplyr functions (I use dplyr for everything, really) but it looks like everything is fine. And I suppose I can just load dplyr after igraph and not have to worry at all. – Floris Jul 29 '16 at 16:36
  • If useful for anyone, you can also add the number of group members using the following: left_join(df, data.frame(values=1:length(clusters(idf)$csize), csize=clusters(idf)$csize)) – Floris Jul 29 '16 at 18:35