1

this question might have been asked previously but I'm looking for a data.table solution if possible without using other packages. Ive got a data.table DT1 as a reference:

> require(data.table)
> DT1 <- data.table(col1 = c("AA", "BA", "ABC", "ABC BC", "AB")
                  , col2 = c(1,4,5,3,2))
> DT1
     col1 col2
1:     AA    1
2:     BA    4
3:    ABC    5
4: ABC BC    3
5:     AB    2

and I would like to merge a second data.table DT2 with DT1 based on partial matching of col1 in DT1 and col2 in DT2, creating a col3 in DT2.

> DT2 <- data.table(col1 = c(0,5,2,7,1,0)
                  , col2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
> DT2
   col1 col2
1:    0   BA
2:    5  ABC
3:    2   DC
4:    7   AA
5:    1   AB
6:    0  R AB

desired output

 > desired_output <- data.table(col1 = c(0,5,5,2,7,1,1,1,0)
                                 , col2 = c("BA", "ABC", "ABC", "DC", "AA",  "AB", "AB", "AB", "R AB")
                                 , col3 = c(4,5,3,NA,1,5,3,2,2))
> desired_output
   col1 col2 col3
1:    0   BA    4
2:    5  ABC    5
3:    5  ABC    3
4:    2   DC   NA
5:    7   AA    1
6:    1   AB    5
7:    1   AB    3
8:    1   AB    2
9:    0  R AB   2

Is there any graceful way to do this using data.table operations? if not then happy to consider other solutions. this is going to be run on a very large dataset.


Edit: to specify the conditions of partial matching, it's a match if either string of col1 in DT1 is a subset of the string of col2 in DT2 or vice versa (the string of col2 in DT2 is a subset of the string of col1 in DT1). A two way grepl?

col1/DT1    col2/DT2
  "AB"       "There is ABhere"    # it's a match
  "ABC"      "someABC"            # it's a match
  "ABC BC"   "ABC"                # it's a reverse match
  "DR"       "ADD"                # no match
  "BA"       "HABAHA"             # two matches
Ankhnesmerira
  • 1,386
  • 15
  • 29
  • A data.table is a data.frame. So, why dont you use data.frame? https://stackoverflow.com/questions/18001120/what-is-the-practical-difference-between-data-frame-and-data-table-in-r since to make a data.table you need an extra package:https://stackoverflow.com/questions/28513319/data-table-error-could-not-find-function – Sal-laS Aug 21 '18 at 06:30
  • because it's a very large data set and dataframe merge operations are painfully slow – Ankhnesmerira Aug 21 '18 at 06:30
  • Is this relevant: https://stackoverflow.com/questions/21763599/imperfect-string-match-using-data-table-in-r? – chinsoon12 Aug 21 '18 at 08:16
  • hmm. it's relevant but it's not exactly the same issue. – Ankhnesmerira Aug 22 '18 at 02:25
  • would splitting by space in col1 of DT1 before joining with DT2 works with your dataset? – chinsoon12 Aug 23 '18 at 01:42
  • I'm not sure if I follow. I've edited the question to be specific. Basically I would like the merge happen if any of these two cases happens: either string of col2 in DT2 is a subset of the string of col1 in DT1, or vice versa, if the string of col1 in DT1 is a subset of the string of col2 in DT2. – Ankhnesmerira Aug 23 '18 at 01:59

1 Answers1

2

Given the dimension of the problem (DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55)]), it is probably infeasible to do a CJ of the IDs before doing a two-way grepl.

Breaking down the different kind of matches/approx. matches, we can 1) first look for exact matches, 2) then approx. matches where substring in DT1 can be found in DT2 and then, 3) vice versa.

Finally, we row bind all the results and do a left join between original DT2 and the row-binded results to get desired output.

exactMatches <- DT1[DT2, on=c("ID1"="ID2"), nomatch=0L][,
    ID2 := ID1]

substr1in2 <- DT2[, c(.SD, DT1[grepl(ID2, ID1) & ID1 != ID2]), 
    by=1:DT2[,.N]][!is.na(VAL1), -1L]

substr2in1 <- DT1[, c(.SD, DT2[grepl(ID1, ID2) & ID2 != ID1]), 
    by=1:DT1[,.N]][!is.na(VAL2), -1L]

binded <- rbindlist(list(exactMatches, substr1in2, substr2in1), 
    use.names=TRUE, fill=TRUE)

binded[DT2, on=.(ID2, VAL2)]

output:

       ID1 VAL1 VAL2  ID2
 1:     BA    4    0   BA
 2:    ABC    5    5  ABC
 3: ABC BC    3    5  ABC
 4:     AB    2    5  ABC
 5:   <NA>   NA    2   DC
 6:     AA    1    7   AA
 7:     AB    2    1   AB
 8:    ABC    5    1   AB
 9: ABC BC    3    1   AB
10:     AB    2    0 R AB

I changed some of the column names to make the code more readable. Data:

DT1 <- data.table(ID1 = c("AA", "BA", "ABC", "ABC BC", "AB"), 
    VAL1 = c(1,4,5,3,2))

DT2 <- data.table(VAL2 = c(0,5,2,7,1,0),
    ID2 = c("BA", "ABC", "DC", "AA", "AB", "R AB"))
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • thanks @chinsoon12, this is already very good. the issue with the current solution though is that the grepl needs to be two ways, or at least the other way around, when grepl(ID, PATTERN) returns VAL1. Unfortunately, my initial example didn't show this case, but i edited the question for clarification (as can be seen in match cases) – Ankhnesmerira Aug 23 '18 at 03:33
  • I added one extra row to DT2 for the case of reverse partial matching – Ankhnesmerira Aug 23 '18 at 03:39
  • what are the dimensions of your DT1 and DT2? – chinsoon12 Aug 23 '18 at 04:36
  • DT1 [(1:50,000), (1:25)] - DT2[(1:50,000,000), (1:55) ] can chunk up DT2 to smaller sets if needs be – Ankhnesmerira Aug 23 '18 at 05:07