13

I have a relatively large data set (1,750,000 lines, 5 columns) which contains records with unique ID values (first column), described by four criteria (4 other columns). A small example would be:

# example
library(data.table)
dt <- data.table(id=c("a1","b3","c7","d5","e3","f4","g2","h1","i9","j6"), 
                 s1=c("a","b","c","l","l","v","v","v",NA,NA), 
                 s2=c("d","d","e","k","k","o","o","o",NA,NA),
                 s3=c("f","g","f","n","n","s","r","u","w","z"),
                 s4=c("h","i","j","m","m","t","t","t",NA,NA))

which looks like this:

   id   s1   s2 s3   s4
 1: a1    a    d  f    h
 2: b3    b    d  g    i
 3: c7    c    e  f    j
 4: d5    l    k  n    m
 5: e3    l    k  n    m
 6: f4    v    o  s    t
 7: g2    v    o  r    t
 8: h1    v    o  u    t
 9: i9 <NA> <NA>  w <NA>
10: j6 <NA> <NA>  z <NA>

My ultimate goal is to find all records with the same character on any description columns (disregarding NAs), and group them under a new ID, so that I can easily identify duplicated records. These IDs are constructed by concatenating the IDs of each row.

Things got messier because I can find those records with duplicated descriptions directly and indirectly. Therefore, I am currently doing this operation in two steps.

STEP 1 - Constructing duplicated IDs based on direct duplicates

# grouping ids with duplicated info in any of the columns
#sorry, I could not find search for duplicates using multiple columns simultaneously...
dt[!is.na(dt$s1),ids1:= paste(id,collapse="|"), by = list(s1)]
dt[!is.na(dt$s1),ids2:= paste(id,collapse="|"), by = list(s2)]
dt[!is.na(dt$s1),ids3:= paste(id,collapse="|"), by = list(s3)]
dt[!is.na(dt$s1),ids4:= paste(id,collapse="|"), by = list(s4)]

# getting a unique duplicated ID for each row
dt$new.id <- apply(dt[,.(ids1,ids2,ids3,ids4)], 1, paste, collapse="|")
dt$new.id <- apply(dt[,"new.id",drop=FALSE], 1, function(x) paste(unique(strsplit(x,"\\|")[[1]]),collapse="|"))

This operation results in the following, with the unique duplicated ID define as "new.id":

   id   s1   s2 s3   s4     ids1     ids2  ids3     ids4   new.id
 1: a1    a    d  f    h       a1    a1|b3 a1|c7       a1 a1|b3|c7
 2: b3    b    d  g    i       b3    a1|b3    b3       b3    b3|a1
 3: c7    c    e  f    j       c7       c7 a1|c7       c7    c7|a1
 4: d5    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 5: e3    l    k  n    m    d5|e3    d5|e3 d5|e3    d5|e3    d5|e3
 6: f4    v    o  s    t f4|g2|h1 f4|g2|h1    f4 f4|g2|h1 f4|g2|h1
 7: g2    v    o  r    t f4|g2|h1 f4|g2|h1    g2 f4|g2|h1 f4|g2|h1
 8: h1    v    o  u    t f4|g2|h1 f4|g2|h1    h1 f4|g2|h1 f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>     <NA>  <NA>     <NA>       NA
10: j6 <NA> <NA>  z <NA>     <NA>     <NA>  <NA>     <NA>       NA

Note that records "b3" and "c7" are duplicated indirectly through "a1" (all other examples are direct duplicates that should remain the same). That is why we need the next step.

STEP 2 - Updating the duplicated IDs based on indirect duplicates

#filtering the relevant columns for the indirect search
dt = dt[,.(id,new.id)]

#creating the patterns to be used by grepl() for the look-up for each row
dt[,patt:= .(paste(paste("^",id,"\\||",sep=""),paste("\\|",id,"\\||",sep=""),paste("\\|",id,"$",sep=""),collapse = "" ,sep="")), by = list(id)]

#Transforming the ID vector into factor and setting it as a 'key' to the data.table (speed up the processing)
dt$new.id = as.factor(dt$new.id)
setkeyv(dt, c("new.id"))

#Performing the loop using sapply
library(stringr)
for(i in 1:nrow(dt)) {
  pat = dt$patt[i] # retrieving the research pattern
  tmp = dt[new.id %like% pat] # searching the pattern using grepl()
  if(dim(tmp)[1]>1) {
    x = which.max(str_count(tmp$new.id, "\\|"))
    dt$new.id[i] = as.character(tmp$new.id[x])
  }
}

#filtering the final columns 
dt = dt[,.(id,new.id)]

The final table looks like:

   id   new.id
 1: a1 a1|b3|c7
 2: b3 a1|b3|c7
 3: c7 a1|b3|c7
 4: d5    d5|e3
 5: e3    d5|e3
 6: f4 f4|g2|h1
 7: g2 f4|g2|h1
 8: h1 f4|g2|h1
 9: i9       NA
10: j6       NA

Note that now the first three records ("a1","b3","c7") are grouped under a broader duplicated ID, which contains both direct and indirect records.

Everything is working out fine, but my code is horrendously slow. It took 2 entire days to run half of the data set (~800,0000). I could parallelize the loop into different cores, but it would still take hours. And I am almost sure that I could use data.table functionalities in a better way, maybe using using 'set' inside the loop. I spent hours today trying to implement the same codes using data.table, but I am new to its syntax and I am really having a hard time here. Any suggestions on how I could optimize this code?

Note: The slowest part of the code is the loop and inside the loop the most inefficient step is the grepl() of the patterns inside the data.table. It seems that setting a 'key' to the data.table can speed up the process, but I did not changed the time it took to do the grepl() in my case.

R. Lima
  • 412
  • 4
  • 15
  • Have you considered vectorizing the operations? Since there are only 26 letters, you could create a variable for each letter/criteria column. Then, you could create an id for each observation having an a in s1 for example. It's not clear to me how these new ids will be helpful, however. – spazznolo Jun 24 '19 at 17:16
  • Thanks for your comment @spazznolo. I am using this code to search for duplicated plant specimens across hundreds of herbaria. In the real data, the id and descriptions are actually much larger and variable characters and with lots of missing descriptions (that is why I am looking for indirect duplicates). I have indeed tried to use 'sapply' to vectorize the loop, but because of the size of the data, it still took almost 2 days of computational time. Thanks again! – R. Lima Jun 24 '19 at 17:27
  • Are the descriptions standardized? Do you have a list of the specimens you are evaluating? – spazznolo Jun 24 '19 at 17:31
  • Thanks again @spazznolo! In the real data, my ids look like: "alcb_237" or "p_120356". The descriptions columns are vectors that contain concatenated descriptions of each record and they look like: "Melastomataceae_jesus_34_sao sebastiao passe" or "Fabaceae_costa_1022_juazeiro". So, I would not say that they are standardized. In addition, there NAs in some of the description columns for some records and for other they are complete. – R. Lima Jun 24 '19 at 17:48
  • Just to understand correctly, does it matter in which column a description appears? E.g., does the letter "k" in column `s2` has the same meaning as if it would appear in column `s4`. For instance if, if the "h" in column `s4` in the 1st row (`id == "a1"`) would became "k" would the expected result be `c(rep('a1|b3|c7|d5|e3', 5), rep('f4|g2|h1', 3))`? – Uwe Jun 25 '19 at 07:12
  • The letter in the description columns were just an example. Maybe a bad one for the case of 'o', 'cause descriptions will most probably not be the same between columns. And, no, changing 'h' by 'k' would not change the results because the search for duplicates id done column by columns (in the real data repeated description between columns should be rare and random). – R. Lima Jun 25 '19 at 10:51

2 Answers2

12

You may approach this as a network problem. Here I use functions from the igraph package. The basic steps:

  1. meltthe data to long format.

  2. Use graph_from_data_frame to create a graph, where 'id' and 'value' columns are treated as an edge list.

  3. Use components to get connected components of the graph, i.e. which 'id' are connected via their criteria, directly or indirectly.

  4. Select the membership element to get "the cluster id to which each vertex belongs".

  5. Join membership to original data.

  6. Concatenate 'id' grouped by cluster membership.


library(igraph)

# melt data to long format, remove NA values
d <- melt(dt, id.vars = "id", na.rm = TRUE)

# convert to graph
g <- graph_from_data_frame(d[ , .(id, value)])

# get components and their named membership id 
mem <- components(g)$membership

# add membership id to original data
dt[.(names(mem)), on = .(id), mem := mem] 

# for groups of length one, set 'mem' to NA
dt[dt[, .I[.N == 1], by = mem]$V1, mem := NA]

If desired, concatenate 'id' by 'mem' column (for non-NA 'mem') (IMHO this just makes further data manipulation more difficult ;) ). Anyway, here we go:

dt[!is.na(mem), id2 := paste(id, collapse = "|"), by = mem]

#     id   s1   s2 s3   s4  mem      id2
#  1: a1    a    d  f    h    1 a1|b3|c7
#  2: b3    b    d  g    i    1 a1|b3|c7
#  3: c7    c    e  f    j    1 a1|b3|c7
#  4: d5    l    k  l    m    2    d5|e3
#  5: e3    l    k  l    m    2    d5|e3
#  6: f4    o    o  s    o    3 f4|g2|h1
#  7: g2    o    o  r    o    3 f4|g2|h1
#  8: h1    o    o  u    o    3 f4|g2|h1
#  9: i9 <NA> <NA>  w <NA>   NA     <NA>
# 10: j6 <NA> <NA>  z <NA>   NA     <NA>

A basic plot of the graph in this small example, just to illustrate the connected components:

plot(g, edge.arrow.size = 0.5, edge.arrow.width = 0.8, vertex.label.cex = 2, edge.curved = FALSE)

enter image description here

Henrik
  • 65,555
  • 14
  • 143
  • 159
  • Dear @Henrik, thanks for your answer. But similar to the previous answer, I have omitted that the data set has lots of NAs in sp1:4, which should not be taken as equal. I have edited my question to include examples with NAs. I am trying to adapt your idea here. I will let you know if I can make it. Thanks again. – R. Lima Jun 24 '19 at 20:03
  • do you think a simple 'na.omit(d,cols = "value")' will be a fast solution? – R. Lima Jun 24 '19 at 20:27
  • Oh I like this, should be faster than repeated joins. – Alexis Jun 24 '19 at 20:59
  • 4
    Cannot thank you enough, @Henrik and Alexis. I just tested both solutions with my entire data set (1,758,253 rows). Both work perfectly well, but Henrik solution took about 30 sec while Alexis solution took about 3 min! From 2 days to less than 3 min of computational time is just magic! Henrik thanks for the additional concatenation codes, since they will be used for storage and retrieval purposes among collections. It finally was the slowest bit of the code, responsible for about 20 sec of the total time, but it was super fast anyways. Thanks again guys! – R. Lima Jun 24 '19 at 21:59
  • 2
    Thank you too @R.Lima for posting a nice first question, with toy data and the code you have tried. Cheers – Henrik Jun 24 '19 at 23:00
6

I think this recursive approach does what you want. Basically, it performs a self-join on each column, one at a time, and if more than one row is matched (i.e. rows other than the row being considered), it saves all unique ids from the match. It avoids using the rows with NA by leveraging secondary indices. The trick is that we do the recursion twice, once with ids, and again but with the newly created new_ids.

dt[, new_id := .(list(character()))]

get_ids <- function(matched_ids, new_id) {
  if (length(matched_ids) > 1L) {
    list(unique(
      c(new_id[[1L]], unlist(matched_ids))
    ))
  } else {
    new_id
  }
}

find_recursively <- function(dt, cols, pass) {
  if (length(cols) == 0L) return(invisible())

  current <- cols[1L]
  next_cols <- cols[-1L]

  next_dt <- switch(
    pass,

    first = dt[!list(NA_character_),
               new_id := dt[.SD, .(get_ids(x.id, i.new_id)), on = current, by = .EACHI]$V1,
               on = current],

    second = dt[!list(NA_character_),
                new_id := dt[.SD, .(get_ids(x.new_id, i.new_id)), on = current, by = .EACHI]$V1,
                on = current]
  )

  find_recursively(next_dt, next_cols, pass)
}

find_recursively(dt, paste0("s", 1:4), "first")
find_recursively(dt, paste0("s", 1:4), "second")

dt[, new_id := sapply(new_id, function(nid) {
  ids <- unlist(nid)
  if (length(ids) == 0L) {
    NA_character_
  } else {
    paste(ids, collapse = "|")
  }
})]

print(dt)
    id   s1   s2 s3   s4   new_id
 1: a1    a    d  f    h a1|b3|c7
 2: b3    b    d  g    i a1|b3|c7
 3: c7    c    e  f    j a1|c7|b3
 4: d5    l    k  l    m    d5|e3
 5: e3    l    k  l    m    d5|e3
 6: f4    o    o  s    o f4|g2|h1
 7: g2    o    o  r    o f4|g2|h1
 8: h1    o    o  u    o f4|g2|h1
 9: i9 <NA> <NA>  w <NA>     <NA>
10: j6 <NA> <NA>  z <NA>     <NA>

The join uses this idiom.

Alexis
  • 4,950
  • 1
  • 18
  • 37
  • I have a feeling this can be improved, not sure if it's fast enough. I'll think about it. – Alexis Jun 24 '19 at 17:49
  • Thanks a million @Alexis. I run a test to see how fats it goes on a sample of the original data. – R. Lima Jun 24 '19 at 17:55
  • Dear @Alexis, I was about to test the performance of the codes you kindly provided but I then realized that I have omitted an very important thing: the data set has lots of NAs in sp1:4, which should not be taken as equal. I have edited my question to include examples with NAs, but I could not adapt your codes to do the same. Any ideas where I should state some similar to a !is.na(sp1:4) in your codes? Thanks agains. – R. Lima Jun 24 '19 at 18:59
  • 1
    @R.Lima I've adapted the code, I was using the idiom wrong anyway. I'm still not sure if this is the best approach, but definitely give it a try and let me know. – Alexis Jun 24 '19 at 20:56
  • @R.Lima out of curiosity, why couldn't you use Henrik's solution? – Alexis Jun 26 '19 at 19:43
  • Beginner's mistake. I thought I could accept both answers since both of then nicely solved the problem. I then re-accept his because it was faster. – R. Lima Jun 27 '19 at 05:45