1

I have learned to use the data.table R package almost exclusively for data manipulation and stringing together multiple data sets due to the size of the data I am working with (each average 3 million records, but up to 27 million records).

I need to check whether a product code, "op_ID" in the example below, is found in another data set. I have used the following method to add a Boolean value of whether a match is found in another data set: DT1[, found_in_DT2 := FALSE][DT2[,"op_ID"], found_in_DT2 := TRUE]. The issue I have now is that I need to also check if other formats are also found in the other data set. I have accomplished this by searching again and setting the Boolean to TRUE if found. However, it seems computationally wasteful to have the entire data set search the data set again, even if products have already been marked as TRUE.

Is there a way to subset the data set, DT1, conditionally before searching another data set, DT2? For example, I would want to something similar to: DT1[found_in_DT2 == FALSE, SEARCH_DT2_HERE]

Code example below:

library(data.table)

DT1 <- data.table(recordID = seq(from = 1, to = 10000), 
           op_ID = c(paste0("23Jul19AA012400020_", seq(from = 1, to = 5000)), paste0("23Jul19AA012400020 ", seq(from = 5001, to = 10000))),
           op_ID_format2 = c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 5000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 5001, to = 10000)))
           )
DT2 <- data.table(recordID = seq(from = 1, to = 7500), 
           op_ID = sample(c(paste0("23Jul19AA012400020_", seq(from = 1, to = 10000)), paste0("23Jul19AA012400020 ", seq(from = 1, to = 10000))), size = 750, replace = TRUE),
           op_ID_format2 = sample(c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 10000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 1, to = 10000))), size = 750, replace = TRUE)
)

# Add Boolean whether an instance of DT1 is found in DT2
  # Set Keys
    setkey(DT1, op_ID)
    setkey(DT2, op_ID)
  # Search
    DT1[, found_in_DT2 := FALSE][DT2[,"op_ID"], found_in_DT2 := TRUE]

Here is what the data looks like at this point:

head(DT1)

   recordID                   op_ID            op_ID_format2 found_in_DT2
1:     1000 23Jul19AA012400020 1000 23Jul19AA01ZZ01AV61 1000         TRUE
2:      501  23Jul19AA012400020 501  23Jul19AA01ZZ01AV61 501        FALSE
3:      502  23Jul19AA012400020 502  23Jul19AA01ZZ01AV61 502         TRUE
4:      503  23Jul19AA012400020 503  23Jul19AA01ZZ01AV61 503         TRUE
5:      504  23Jul19AA012400020 504  23Jul19AA01ZZ01AV61 504        FALSE
6:      505  23Jul19AA012400020 505  23Jul19AA01ZZ01AV61 505        FALSE

head(DT2)

   recordID                   op_ID           op_ID_format2
1:      622 23Jul19AA012400020 1000 23Jul19AA01ZZ01AV61 477
2:      174  23Jul19AA012400020 101 23Jul19AA01ZZ01AV61 481
3:       60  23Jul19AA012400020 102 23Jul19AA01ZZ01AV61 395
4:      484  23Jul19AA012400020 108 23Jul19AA01ZZ01AV61 356
5:      380  23Jul19AA012400020 109 23Jul19AA01ZZ01AV61 105
6:      747  23Jul19AA012400020 113 23Jul19AA01ZZ01AV61 845
# Account for either a '_' or ' ' as the separator
  # I understand that I can do this in place of the first search, but I would rather...
  # ... have a 2nd step if I can search a subset to avoid...
  # ... trimming every "op_ID" in the dataset.
  DT1[, op_ID_trimmed := gsub(pattern = " ", replacement = "", gsub(pattern = "_", replacement = "", op_ID))]
  DT2[, op_ID_trimmed := gsub(pattern = " ", replacement = "", gsub(pattern = "_", replacement = "", op_ID))]
  # Set Keys
    setkey(DT1, op_ID_trimmed)
    setkey(DT2, op_ID_trimmed)
  # Search
    DT1[DT2[,"op_ID_trimmed"], found_in_DT2 := TRUE]
# Try the second format:
    # Set Keys
      setkey(DT1, op_ID_format2)
      setkey(DT2, op_ID_format2)
    # Search
      DT1[DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
      
# Here is what I wish I could do:
# Something like:
  # DT1[found_in_DT2 == FALSE, DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
      # Did not work: - note I have tried many other examples in my code, but this sums it up
      #   DT1[found_in_DT2 == FALSE, found_in_DT2 := DT1[DT2[,"op_ID_format2"], on = .(op_ID_format2 = op_ID_format2)]]

I have searched stackoverflow, but could not find a post which has a similar scenario. The solution in the post linked below is the closest I found, but I could not get their solution to work with my scenario. If this method is suggested, I will attempt this method again. conditional data.table match for subset of data.table

[EDIT] Using 'Vasily's suggestion, I tried the following, but actually found using a subset to search with this syntax is slower. I also upped the size of the datasets by 10x so performance differences will be more evident.

 benchmark(
          "full_search" = {
            full <- DT1[DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
          },"sub_search" = {
            subset <- DT1[found_in_DT2 == FALSE, found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]
          },"TEST_search" = {
            subset <- DT1[,found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]
          }, replications = 10000
        )

Results:

         test replications elapsed relative user.self sys.self
1 full_search        10000   10.55    1.667      9.97     0.83
2  sub_search        10000   30.93    4.886     29.03     4.67
3 TEST_search        10000    6.33    1.000      6.31     0.00

Why is it slower to search using a conditional subset of the data set? Is it because data.table must find the conditional subset first?

I expanded the sample data set by 100x which is more align with production data sets (see below)

DT1 <- data.table(recordID = seq(from = 1, to = 1000000), 
           op_ID = c(paste0("23Jul19AA012400020_", seq(from = 1, to = 5000)), paste0("23Jul19AA012400020 ", seq(from = 500001, to = 1000000))),
           op_ID_format2 = c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 500000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 500001, to = 1000000)))
           )
DT2 <- data.table(recordID = seq(from = 1, to = 7500000), 
           op_ID = sample(c(paste0("23Jul19AA012400020_", seq(from = 1, to = 1000000)), paste0("23Jul19AA012400020 ", seq(from = 1, to = 100000))), size = 7500000, replace = TRUE),
           op_ID_format2 = sample(c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 1000000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 1, to = 100000))), size = 7500000, replace = TRUE)
)

And had the following results using Rbenchmark:

         test replications elapsed relative user.self sys.self
1 full_search        10000 2627.45    1.000   2430.45   324.76
2  sub_search        10000 3312.79    1.261   3032.13   378.68
3 TEST_search        10000 3000.94    1.142   2715.04   284.45

Searching using the subset is still much slower.

JGrant06
  • 53
  • 4
  • can you explain a bit more what exactly you want to achieve in your last piece? Also, why can't you just use something like `DT1[, found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]`? – Vasily A Nov 25 '20 at 01:31
  • @VasilyA Thank you for responding. I want to use a subset of DT1 (where found_in_DT2 == FALSE) to search DT2, because I believe it will be a lot faster than using the entire DT1 since I am using 4 million record data sets in production. Does that clarify? I was actually able to search with a subset using your syntax `DT1[found_in_DT2 == FALSE, found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]`, but it seems to be a lot slower with rbenchmark (10000) iterations - I will add the benchmark parameters to the end of the original post. – JGrant06 Nov 25 '20 at 02:49
  • ah so you want that piece to _replace_ the `DT1[DT2[,"op_ID_format2"], found_in_DT2 := TRUE]` expression? is that what you meant? – Vasily A Nov 25 '20 at 05:53
  • @VasilyA Yep! I want to replace that statement with one which uses a subset of DT1 to search DT2. Although, it seems that this method is slower using the syntax you provided in your first comment - see the benchmarking I ran in the edit above. – JGrant06 Nov 25 '20 at 12:34
  • unfortunately I don't have other suggestions, we need some of `data.table` gurus here. – Vasily A Nov 25 '20 at 18:31
  • P.S. for your `op_ID_trimmed` - maybe using one `gsub()` could be faster? try this: `DT1[, op_ID_trimmed1 := gsub("[ _]", "", op_ID)]`. Alternatively, `fixed` option is sometimes faster: `DT1[, op_ID_trimmed2 := gsub(" ", "", gsub("_", "", op_ID, fixed=T), fixed=T)]` – Vasily A Nov 25 '20 at 18:36

0 Answers0