I have learned to use the data.table R package almost exclusively for data manipulation and stringing together multiple data sets due to the size of the data I am working with (each average 3 million records, but up to 27 million records).
I need to check whether a product code, "op_ID" in the example below, is found in another data set. I have used the following method to add a Boolean value of whether a match is found in another data set:
DT1[, found_in_DT2 := FALSE][DT2[,"op_ID"], found_in_DT2 := TRUE]
. The issue I have now is that I need to also check if other formats are also found in the other data set. I have accomplished this by searching again and setting the Boolean to TRUE if found. However, it seems computationally wasteful to have the entire data set search the data set again, even if products have already been marked as TRUE.
Is there a way to subset the data set, DT1, conditionally before searching another data set, DT2? For example, I would want to something similar to: DT1[found_in_DT2 == FALSE, SEARCH_DT2_HERE]
Code example below:
library(data.table)
DT1 <- data.table(recordID = seq(from = 1, to = 10000),
op_ID = c(paste0("23Jul19AA012400020_", seq(from = 1, to = 5000)), paste0("23Jul19AA012400020 ", seq(from = 5001, to = 10000))),
op_ID_format2 = c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 5000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 5001, to = 10000)))
)
DT2 <- data.table(recordID = seq(from = 1, to = 7500),
op_ID = sample(c(paste0("23Jul19AA012400020_", seq(from = 1, to = 10000)), paste0("23Jul19AA012400020 ", seq(from = 1, to = 10000))), size = 750, replace = TRUE),
op_ID_format2 = sample(c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 10000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 1, to = 10000))), size = 750, replace = TRUE)
)
# Add Boolean whether an instance of DT1 is found in DT2
# Set Keys
setkey(DT1, op_ID)
setkey(DT2, op_ID)
# Search
DT1[, found_in_DT2 := FALSE][DT2[,"op_ID"], found_in_DT2 := TRUE]
Here is what the data looks like at this point:
head(DT1)
recordID op_ID op_ID_format2 found_in_DT2
1: 1000 23Jul19AA012400020 1000 23Jul19AA01ZZ01AV61 1000 TRUE
2: 501 23Jul19AA012400020 501 23Jul19AA01ZZ01AV61 501 FALSE
3: 502 23Jul19AA012400020 502 23Jul19AA01ZZ01AV61 502 TRUE
4: 503 23Jul19AA012400020 503 23Jul19AA01ZZ01AV61 503 TRUE
5: 504 23Jul19AA012400020 504 23Jul19AA01ZZ01AV61 504 FALSE
6: 505 23Jul19AA012400020 505 23Jul19AA01ZZ01AV61 505 FALSE
head(DT2)
recordID op_ID op_ID_format2
1: 622 23Jul19AA012400020 1000 23Jul19AA01ZZ01AV61 477
2: 174 23Jul19AA012400020 101 23Jul19AA01ZZ01AV61 481
3: 60 23Jul19AA012400020 102 23Jul19AA01ZZ01AV61 395
4: 484 23Jul19AA012400020 108 23Jul19AA01ZZ01AV61 356
5: 380 23Jul19AA012400020 109 23Jul19AA01ZZ01AV61 105
6: 747 23Jul19AA012400020 113 23Jul19AA01ZZ01AV61 845
# Account for either a '_' or ' ' as the separator
# I understand that I can do this in place of the first search, but I would rather...
# ... have a 2nd step if I can search a subset to avoid...
# ... trimming every "op_ID" in the dataset.
DT1[, op_ID_trimmed := gsub(pattern = " ", replacement = "", gsub(pattern = "_", replacement = "", op_ID))]
DT2[, op_ID_trimmed := gsub(pattern = " ", replacement = "", gsub(pattern = "_", replacement = "", op_ID))]
# Set Keys
setkey(DT1, op_ID_trimmed)
setkey(DT2, op_ID_trimmed)
# Search
DT1[DT2[,"op_ID_trimmed"], found_in_DT2 := TRUE]
# Try the second format:
# Set Keys
setkey(DT1, op_ID_format2)
setkey(DT2, op_ID_format2)
# Search
DT1[DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
# Here is what I wish I could do:
# Something like:
# DT1[found_in_DT2 == FALSE, DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
# Did not work: - note I have tried many other examples in my code, but this sums it up
# DT1[found_in_DT2 == FALSE, found_in_DT2 := DT1[DT2[,"op_ID_format2"], on = .(op_ID_format2 = op_ID_format2)]]
I have searched stackoverflow, but could not find a post which has a similar scenario. The solution in the post linked below is the closest I found, but I could not get their solution to work with my scenario. If this method is suggested, I will attempt this method again. conditional data.table match for subset of data.table
[EDIT] Using 'Vasily's suggestion, I tried the following, but actually found using a subset to search with this syntax is slower. I also upped the size of the datasets by 10x so performance differences will be more evident.
benchmark(
"full_search" = {
full <- DT1[DT2[,"op_ID_format2"], found_in_DT2 := TRUE]
},"sub_search" = {
subset <- DT1[found_in_DT2 == FALSE, found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]
},"TEST_search" = {
subset <- DT1[,found_in_DT2:=(op_ID_format2 %in% DT2$op_ID_format2)]
}, replications = 10000
)
Results:
test replications elapsed relative user.self sys.self
1 full_search 10000 10.55 1.667 9.97 0.83
2 sub_search 10000 30.93 4.886 29.03 4.67
3 TEST_search 10000 6.33 1.000 6.31 0.00
Why is it slower to search using a conditional subset of the data set? Is it because data.table must find the conditional subset first?
I expanded the sample data set by 100x which is more align with production data sets (see below)
DT1 <- data.table(recordID = seq(from = 1, to = 1000000),
op_ID = c(paste0("23Jul19AA012400020_", seq(from = 1, to = 5000)), paste0("23Jul19AA012400020 ", seq(from = 500001, to = 1000000))),
op_ID_format2 = c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 500000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 500001, to = 1000000)))
)
DT2 <- data.table(recordID = seq(from = 1, to = 7500000),
op_ID = sample(c(paste0("23Jul19AA012400020_", seq(from = 1, to = 1000000)), paste0("23Jul19AA012400020 ", seq(from = 1, to = 100000))), size = 7500000, replace = TRUE),
op_ID_format2 = sample(c(paste0("23Jul19AA01ZZ01AV61_", seq(from = 1, to = 1000000)), paste0("23Jul19AA01ZZ01AV61 ", seq(from = 1, to = 100000))), size = 7500000, replace = TRUE)
)
And had the following results using Rbenchmark:
test replications elapsed relative user.self sys.self
1 full_search 10000 2627.45 1.000 2430.45 324.76
2 sub_search 10000 3312.79 1.261 3032.13 378.68
3 TEST_search 10000 3000.94 1.142 2715.04 284.45
Searching using the subset is still much slower.