1

I am trying to match all customers in data set and give similar id to those whose zip code match 100%, address and email match 85%. I was able to do this using Record Linkage package in R. Now I have result like this:

x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))

ID1 ID2
1   2
2   5
3   4
5   11
10  11
11  18
12  18

But i want to group together all IDs which match like 1,2,5,11,10,12,18 are all same so i would like to give them same id.

Basically I want output like this:

Group   Key
1        1
1        2
1        5
1       11
1       10
1       12
1       18
3        3
3        4
Michael
  • 5,808
  • 4
  • 30
  • 39
Ankita
  • 23
  • 6
  • Welcome to stackoverflow, @Ankita Gupta. I'm not sure from your question what you're looking for. What's the pattern for determining which group each ID should go into? – shirewoman2 Apr 11 '19 at 18:17
  • see also https://stackoverflow.com/questions/27466321/combine-pairs-of-integers-based-on-common-element – Michael Apr 11 '19 at 19:12
  • https://stackoverflow.com/questions/38663170/grouping-linked-unique-id-pairs-using-r?rq=1 – Michael Apr 11 '19 at 19:18
  • https://stackoverflow.com/questions/45079559/make-a-group-indices-based-on-several-columns – Michael Apr 11 '19 at 19:18

2 Answers2

1

The code below gives me the output I am looking for:

x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))
library(igraph) 
y = graph.data.frame(x) 
z = stack(clusters(y)$membership) 
Maël
  • 45,206
  • 3
  • 29
  • 67
Ankita
  • 23
  • 6
0

I've encountered this issue before and I've wondered whether there's a fast solution in R. If you're trying to find intervals which group rows together it's possible to do this using data.table::foverlaps (see Is it possible to use the R data.table function foverlaps to find the intersection of overlapping ranges in two tables?), but there's no way that I'm familiar for grouping sets (rather than numeric intervals) together. Here's a solution but it's probably not the fastest approach.

    x <- data.frame(ID1=c(1,2, 3, 5, 10, 11, 12), ID2=c(2,5,4,11,11,18,18))


sets <- list()
for(i in 1:nrow(x)){
  temp <- unique(unlist(x[i,]))
  if(length(sets)==0){
    sets[[1]] <- temp
  }else{
    in_sets <- sapply(sets, function(s)any(temp%in%s))  
    if(sum(in_sets)==0){
      sets[[length(sets)+1]] <- temp
    }
    if(sum(in_sets)==1){
      sets[[which(in_sets)]] <- union(sets[[which(in_sets)]],temp)
    }

    if(sum(in_sets)>1){
      sets[[which.min(in_sets)]] <-  union(unlist(sets[in_sets]),temp)
      sets[which(in_sets)[-1] ] <-  NULL
    }

  }

}

do.call("rbind",mapply(sets,1:length(sets),SIMPLIFY=FALSE,FUN=function(x,n){
  data.frame(Group=n,Key=x)
}))
Michael
  • 5,808
  • 4
  • 30
  • 39
  • Thank you. Your code works. But i have large data set so I want to avoid loops. I found a package in R which does this and its takes very less time to run. library(igraph) y = graph.data.frame(x) z = stack(clusters(y)$membership) This code gives me the output I am looking for. – Ankita Apr 11 '19 at 19:27
  • you should answer your own question if you found a solution – Michael Apr 11 '19 at 19:28