how to take data relationships and assembles the data into groups based on those relationships in R?

Question

I am trying to match all customers in data set and give similar id to those whose zip code match 100%, address and email match 85%. I was able to do this using Record Linkage package in R. Now I have result like this:

x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))

ID1 ID2
1   2
2   5
3   4
5   11
10  11
11  18
12  18

But i want to group together all IDs which match like 1,2,5,11,10,12,18 are all same so i would like to give them same id.

Basically I want output like this:

Group   Key
1        1
1        2
1        5
1       11
1       10
1       12
1       18
3        3
3        4

Welcome to stackoverflow, @Ankita Gupta. I'm not sure from your question what you're looking for. What's the pattern for determining which group each ID should go into? — shirewoman2, Apr 11 '19 at 18:17
see also https://stackoverflow.com/questions/27466321/combine-pairs-of-integers-based-on-common-element — Michael, Apr 11 '19 at 19:12
https://stackoverflow.com/questions/38663170/grouping-linked-unique-id-pairs-using-r?rq=1 — Michael, Apr 11 '19 at 19:18
https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns — Michael, Apr 11 '19 at 19:18

score 1 · Answer 1 · edited Jan 30 '23 at 09:51

1

The code below gives me the output I am looking for:

x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))
library(igraph) 
y = graph.data.frame(x) 
z = stack(clusters(y)$membership)

edited Jan 30 '23 at 09:51

Maël

45,206
3
29
67

answered Apr 11 '19 at 20:09

Ankita

23
6

Michael · Accepted Answer · 2019-04-11T19:08:59.923

I've encountered this issue before and I've wondered whether there's a fast solution in R. If you're trying to find intervals which group rows together it's possible to do this using data.table::foverlaps (see Is it possible to use the R data.table function foverlaps to find the intersection of overlapping ranges in two tables?), but there's no way that I'm familiar for grouping sets (rather than numeric intervals) together. Here's a solution but it's probably not the fastest approach.

    x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))


sets <- list()
for(i in 1:nrow(x)){
  temp <- unique(unlist(x[i,]))
  if(length(sets)==0){
    sets[[1]] <- temp
  }else{
    in_sets <- sapply(sets, function(s)any(temp%in%s))  
    if(sum(in_sets)==0){
      sets[[length(sets)+1]] <- temp
    }
    if(sum(in_sets)==1){
      sets[[which(in_sets)]] <- union(sets[[which(in_sets)]],temp)
    }

    if(sum(in_sets)>1){
      sets[[which.min(in_sets)]] <-  union(unlist(sets[in_sets]),temp)
      sets[which(in_sets)[-1] ] <-  NULL
    }

  }

}

do.call("rbind",mapply(sets,1:length(sets),SIMPLIFY=FALSE,FUN=function(x,n){
  data.frame(Group=n,Key=x)
}))

Thank you. Your code works. But i have large data set so I want to avoid loops. I found a package in R which does this and its takes very less time to run. library(igraph) y = graph.data.frame(x) z = stack(clusters(y)$membership) This code gives me the output I am looking for. — Ankita, Apr 11 '19 at 19:27

how to take data relationships and assembles the data into groups based on those relationships in R?

2 Answers2