Based on an almost identical question, I am trying to create unique based on several columns where rows should grouped into the same ID if "there exists a path through any combination of the columns". The difference is that I have NAs that should not be used to link rows:
The goal is for R to create id3
based on id1
and id2
, minimal example:
For example id1=1
is related to a
and b
of id2
. But id1=2
is also related to a
so both belong to one group (id3=group1
). But since id1=2
and id1=3
share id2=c
, also id1=3
belongs to that group (id3=1
). The values of the tuple ((1,2),('a','b','c'))
appear no where else, so no other row belongs to that group (which is labeled group1
generically).
library(igraph)
df = data.frame(id1 = c(1,1,2,2,3,3,4,4,5,5,6,6,NA,NA),
id2 = c('a',NA,'a','c','c','d','x',NA,'y','z','x','z',NA,NA),
id3 = c(rep('group1',6), rep('group2',6),NA,NA))
My solution fails with NA
values.
g <- graph_from_data_frame(df, FALSE)
cg <- clusters(g)$membership
df$id4 <- cg[df$id1]
df
Obervation (row) 2 and 8 are linked because both have NA
for id2
, but this should be ignored. Is there a way t