Expanding on the comments.
From Arun's post in What is the purpose of setting a key in data.table? that Frank provided in comment:
- Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.
and
It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.
Hence, the OP's quickly and efficiently 'data.table' - style solution really depends on the dimension of the problem, i.e. size of dataset and the number of searches that will be performed.
Here are some timings if both are large:
data:
library(data.table)
set.seed(0L)
M <- 1e7
dtKeyed <- data.table(x=1:M, y=2:(M+1)) #R-3.4.4 data.table_1.10.4-3 win-x64
dtNoKey <- copy(dtKeyed)
system.time(setkey(dtKeyed, x, y)) #not free
dtKeyed
nsearches <- 1e3
points <- apply(matrix(sample(M, nsearches*2, replace=TRUE), ncol=2), 1, as.list)
variations:
findPtNoKey <- function() {
lapply(points, function(p) dtNoKey[p, on=names(dtNoKey), .N > 0, nomatch=0])
}
findPtOnKey <- function() {
lapply(points, function(p) dtKeyed[p, on=names(dtKeyed), .N > 0, nomatch=0])
}
findPtKeyed <- function() {
lapply(points, function(p) dtKeyed[p, .N > 0, nomatch=0])
}
library(microbenchmark)
microbenchmark(findPtKeyed(), findPtOnKey(), findPtNoKey(), times=3L)
timings:
#rem to add back the timing from setkey into the timing for findPtKeyed
Unit: milliseconds
expr min lq mean median uq max neval
findPtKeyed() 924.6846 928.3025 946.0892 931.9205 956.7914 981.6624 3
findPtOnKey() 1119.9686 1129.5641 1143.4505 1139.1597 1155.1915 1171.2233 3
findPtNoKey() 146186.2216 154934.5463 161016.1277 163682.8709 168431.0807 173179.2905 3
accuracy checks:
ref <- findPtNoKey()
identical(findPtKeyed(), ref)
#[1] TRUE
identical(findPtOnKey(), ref)
#[1] TRUE