16

I would like to generate indices to group observations based on two columns. But I want groups to be made of observation that share, at least one observation in commons.

In the data below, I want to check if values in 'G1' and 'G2' are connected directly (appear on the same row), or indirectly via other intermediate values. The desired grouping variable is shown in 'g'.

For example, A is directly linked to Z (row 1) and X (row 2). A is indirectly linked to 'B' via X (A -> X -> B), and further linked to Y via X and B (A -> X -> B -> Y).

dt <- data.frame(id = 1:10,
                 G1 = c("A","A","B","B","C","C","C","D","E","F"),
                 G2 = c("Z","X","X","Y","W","V","U","s","T","T"),
                 g = c(1,1,1,1,2,2,2,3,4,4))

dt
#    id G1 G2 g
# 1   1  A  Z 1
# 2   2  A  X 1
# 3   3  B  X 1
# 4   4  B  Y 1
# 5   5  C  W 2
# 6   6  C  V 2
# 7   7  C  U 2
# 8   8  D  s 3
# 9   9  E  T 4
# 10 10  F  T 4

I tried with group_indices from dplyr, but haven't managed it.

Maël
  • 45,206
  • 3
  • 29
  • 67
Malta
  • 1,883
  • 3
  • 17
  • 30

1 Answers1

19

Using igraph get membership, then map on names:

library(igraph)

# convert to graph, and get clusters membership ids
g <- graph_from_data_frame(df1[, c(2, 3, 1)])
myGroups <- components(g)$membership

myGroups 
# A B C D E F Z X Y W V U s T 
# 1 1 2 3 4 4 1 1 1 2 2 2 3 4 

# then map on names
df1$group <- myGroups[df1$G1]


df1
#    id G1 G2 group
# 1   1  A  Z     1
# 2   2  A  X     1
# 3   3  B  X     1
# 4   4  B  Y     1
# 5   5  C  W     2
# 6   6  C  V     2
# 7   7  C  U     2
# 8   8  D  s     3
# 9   9  E  T     4
# 10 10  F  T     4
zx8754
  • 52,746
  • 12
  • 114
  • 209