1

I have a set of pairs. Each pair is represented as [i,1:2]. That is, the ith pair are the numbers in the first and second column in the ith row.

I need to sort these pairs into distinct groups, s.t. there is no element in any pair in the jth group that is in any group not j. For example:

EXAMPLE 1: DATA

> col1 <- c(3, 4, 6, 7, 10, 8)
> col2 <- c(6, 7, 3, 4, 3,  1)
> 
> dat <- cbind(col1, col2)
> rownames(dat) <- 1:nrow(dat)
> 
> dat
  col1 col2
1    3    6
2    4    7
3    6    3
4    7    4
5   10    3
6    8    1

For all pairs, it doesn't matter if the number is in column 1 or column 2, the pairs should be sorted into groups s.t. every number in every pair in every group exists only in one group. So the solved example would look like this.

  col1 col2 groups
1    3    6      1
2    4    7      2
3    6    3      1
4    7    4      2
5   10    3      1
6    8    1      3

Rows 1, 3, and 5 are grouped together because 1 and 3 contain the same numbers and 5 shares the number 3, so it must be grouped with them. 2 and 4 share the same distinct numbers so they are grouped together and 6 has unique numbers so it is left alone.

If we change the data slightly, note the following.

EXAMPLE 2: NEW DATA

Note what happens when we add a row that shares an element with row 6 and row 5.

  col1 col2 groups
1    3    6      1
2    4    7      2
3    6    3      1
4    7    4      2
5   10    3      1
6    8    1      1
7    1   10      1

The 10 in the 7th row connects it to the first group because it shares an elements with the 5th row. It also shares an element with the 6th row (the number 1), so the 6th row would instead be in group 1.

PROBLEM

Is there a simple way to form the groups? A vector operation? A sorting algorithm? It gets very nasty very quickly if you try to do it with a loop, since each subsequent row can change the membership of earlier rows, as demonstrated in the example.

  • Is this helpful? [Identify groups of linked episodes which chain together](http://stackoverflow.com/questions/12135971/identify-groups-of-linked-episodes-which-chain-together) – thelatemail Aug 07 '14 at 04:00
  • 1
    @MrFlick - I wouldn't call it a duplicate necessarily. It is very similar, but slightly different. – thelatemail Aug 07 '14 at 04:06
  • It looked like exact same igraph code would work. I might have missed something. – MrFlick Aug 07 '14 at 04:11
  • 1
    @MrFlick - it nearly does, though this question wants the results assigned to pairs of cases rather than each individual case. Thus requiring a bit of merge-ing back to get the intended result. The downvote is probably a bit harsh too (not sure if it was you). I know from experience that this is a rather difficult topic to search for if you don't know about graph structures in the first place. – thelatemail Aug 07 '14 at 04:32
  • @thelatemail Alright, if you feel it's substantially different enough, i've removed my close vote. I did not down vote; i didn't think it was a bad question, just thought it already had an answer. – MrFlick Aug 07 '14 at 04:36

2 Answers2

3

To draw on the old answer at: identify groups of linked episodes which chain together , which assigns a group to each individual value, you could try this to assign a group to each linked pair:

library(igraph)
g <- graph_from_data_frame(dat)
links <- data.frame(col1=V(g)$name,group=components(g)$membership)
merge(dat,links,by="col1",all.x=TRUE,sort=FALSE)

#  col1 col2 group
#1    3    6     1
#2    4    7     2
#3    6    3     1
#4    7    4     2
#5   10    3     1
#6    8    1     3
thelatemail
  • 91,185
  • 12
  • 128
  • 188
2

Your elements can be regarded as vertices in an undirected graph, and your pairs can be regarded as edges, and then (assuming that you want to find groups of minimal size -- if you don't, then e.g. the entire set of pairs could be labelled "Group 1") the groups you're looking for are the connected components in this graph. They can all be found in linear time with a depth-first or breadth-first search.

j_random_hacker
  • 50,331
  • 10
  • 105
  • 169