4

I have two data.tables. Each has a column called 'firstName' and another called 'lastName', which contain some values which will match each other and some that won't. Some values in both data sets might be duplicated.

I want to add a new column to the second data.table, in which I will store the indices of matches from the first data set for each element of 'firstName' within the second data set. I will then repeat the whole matching process with the 'lastName' column and get the intersect of index matches for 'firstName' and 'lastName'. I will then use the intersect of the indices to fetch the case ID (cid) from the first data set and append it to the second data set.

Because there might be more than one match per element, I will store them as lists within my data.table. I cannot use base::match function because it will only return the first match for each element, but I do need the answer to be vectorised in just the same way as the match function.

I've tried different combinations of which(d1$x %in% y) but this does not work either because it matches for all of y at once instead of one element at a time. I am using data.table because for my real-world use case, the data set to match on could be hundreds of thousands of records, so speed is important.

I have found a related question here, but I can't quite figure out how to efficiently convert this to data.table syntax.

Here is some example data:

# Load library
library(data.table)

# First data set (lookup table):
dt1 <- data.table(cid = c("c1", "c2", "c3", "c4", "c5"), 
                  firstName = c("Jim", "Joe", "Anne", "Jim", "Anne"), 
                  lastName = c("Gracea", "Ali", "Mcfee", "Dutto", "Crest"))

# Second data set (data to match with case IDs from lookup table):
dt2 <- data.table(lid = c(1, 2, 3, 4), 
                  firstName = c("Maria", "Jim", "Jack", "Anne"), 
                  lastName = c("Antonis", "Dutto", "Blogs", "Mcfee"),
                  result = c("pos", "neg", "neg", "pos"))

My desired output would look like this:

# Output:
> dt2
   lid firstName lastName result fn_match ln_match casematch caseid
1:   1     Maria  Antonis    pos       NA       NA        NA   <NA>
2:   2       Jim    Dutto    neg      1,4        4         4     c4
3:   3      Jack    Blogs    neg       NA       NA        NA   <NA>
4:   4      Anne    Mcfee    pos      3,5        3         3     c3

Henrik
  • 65,555
  • 14
  • 143
  • 159
Amy M
  • 967
  • 1
  • 9
  • 19
  • I noted that you expanded the question quite a bit in your last edit. I rolled it back :) If you have additional questions, it is much better post a new question (possibly refering to this, and using what you have already learned here). And thanks for creating such nice, small toy examples! Cheers – Henrik Sep 05 '22 at 22:19
  • I had just added some extra data to the example to make it a bit clearer what the end goal was, as some of the other responses would not have worked well when repeating for different columns so I thought it would be helpful to show that that was what I was trying to do. I didn't change the question itself as I don't need help with intersecting the matches from each column, I have already figured that out. – Amy M Sep 05 '22 at 22:34

5 Answers5

4

A possible solution:

dt1[,id:=seq_along(cid)]
dt1[dt2,.(lid,id,firstName = i.firstName),on=.(firstName)][
       ,.(casematch =.( id)),by=.(lid,firstName)]

     lid firstName casematch
   <num>    <char>    <list>
1:     1     Maria        NA
2:     2       Jim       1,4
3:     3      Jack        NA
4:     4      Anne       3,5
Waldi
  • 39,242
  • 6
  • 30
  • 78
4

We could use

library(data.table)
dt1[dt2, .(casematch = toString(cid), lid),on = .(firstName), by = .EACHI]

-output

  firstName casematch   lid
      <char>    <char> <num>
1:     Maria        NA     1
2:       Jim    c1, c4     2
3:      Jack        NA     3
4:      Anne    c3, c5     4

Or with row index

dt1[dt2, .(casematch = na_if(toString(.I), 0), lid),on = .(firstName), by = .EACHI]
   firstName casematch   lid
      <char>    <char> <num>
1:     Maria      <NA>     1
2:       Jim      1, 4     2
3:      Jack      <NA>     3
4:      Anne      3, 5     4
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I like the shorter syntax of this answer, but one thing to note is that the casematch column is better off referring to row indices, as this is not the end of the operation. I am then doing the same thing with other columns (lastName and birthDate) and getting the intersection of casematch for all three columns. At that point I do want to append the cid that matches for all three columns to dt2 (which then gets exported in full with the new columns reporting on the matching added to it). – Amy M Sep 05 '22 at 19:39
  • Thanks, this is almost there but how can I ensure that I am just appending one new column (casematch) to dt2 (as my real dt2 has many other columns and I don't want to have to mention each one explicitly)? – Amy M Sep 05 '22 at 19:57
2

Another data.table option

> dt1[, .(cid = toString(cid)), firstName][dt2, on = .(firstName)]
   firstName    cid lid
1:     Maria   <NA>   1
2:       Jim c1, c4   2
3:      Jack   <NA>   3
4:      Anne c3, c5   4
ThomasIsCoding
  • 96,636
  • 9
  • 24
  • 81
2

Using .EACHI and adding the resulting list column by reference.

dt2[ , res := dt1[ , i := .I][.SD, on = .(firstName), .(.(i)), by = .EACHI]$V1]
#    lid firstName res
# 1:   1     Maria  NA
# 2:   2       Jim 1,4
# 3:   3      Jack  NA
# 4:   4      Anne 3,5
Henrik
  • 65,555
  • 14
  • 143
  • 159
  • I've marked this as the correct answer, since rather than merging data sets it creates a new column with the matches in dt2, which is what I wanted. I can easily replicate this line for other column matches. Could you explain what `$V1` is referencing at the end (so I can understand a bit better how this works)? I think I understand how `i` is getting the indices from dt1. – Amy M Sep 05 '22 at 21:55
  • 1
    Thanks for the feedback @Amy M. About `V1`: if we don't provide column names for expressions in `j`, they will be named automatically as `V1`, `V2`, and so on. Try to run the code step by step: `dt1[ , i := .I]`; `dt1[dt2, on = .(firstName), .(.(i)), by = .EACHI] `; `dt1[dt2, on = .(firstName), .(.(i)), by = .EACHI]$V1`. Hope this helps! Cheers – Henrik Sep 05 '22 at 22:13
0

In my real life scenario, I need to retrieve the indices for matches on more than one column. I found a way to do this in one step by combining some of the other solutions and figured it would be useful to also share this and the explanation of how it works below.

The code below adds a new column caseid to dt2, which gets its values from the column cid in dt1 for the row indices that matched on both firstName and lastName.

Putting dt1 inside the square brackets and specifying on = .(...) is equivalent to merging dt1 with dt2 on firstName and lastName, but instead of merging all columns from both datasets, one new column called caseid is created.

The lower case i. prefix to cid indicates that cid is a column from the second data set (dt1).

The upper case .I inside the square brackets after i.cid will retrieve the row indices of dt1 that match dt2 on firstName and lastName.

# Get case IDs from dt1 for matches of firstName and lastName in one step:
dt2[dt1, caseid := i.cid[.I], on = .(firstName, lastName)]

# Output:
> dt2
   lid firstName lastName result caseid
1:   1     Maria  Antonis    pos   <NA>
2:   2       Jim    Dutto    neg     c4
3:   3      Jack    Blogs    neg   <NA>
4:   4      Anne    Mcfee    pos     c3

Amy M
  • 967
  • 1
  • 9
  • 19