I'm new to data.table
and seem to be missing something obvious. I have a table:
DT = data.table(A = c("x","y","y","z"), B = c("y","x","x","z"), value = 1:4)
setkey(DT, A, B)
Now I want to find all rows where either A
or B
is "y"
(using binary search, my actual tables are larger and the operation has to be performed millions of times). I couldn't figure out how to do this in one statement, since,
DT[.("y", "y"), nomatch=0]
Gives me only rows where (A & B) == "y"
(but I want (A | B) == "y"
). What I'm doing now is:
uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
But I feel like there must be a more intuitive way.
Thanks for you help!
Benchmark
Including code adapted from @Frank's comment on Binary search DT with key on two columns using alternative (OR) instead of a conjunction
n = 1e6
DT = data.table(A = sample(letters, n, replace = TRUE),
B = sample(letters, n, replace = TRUE), value = 1:n)
setkey(DT, A, B)
uA <- unique(DT[, A])
library(microbenchmark)
Union = function(){
mya = DT[A=="y", which=TRUE]
myb = DT[B=="y", which=TRUE]
DT[union(mya,myb)]
}
microbenchmark(
"reduce" = DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]],
"rbind" = rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0]),
"union" = Union()
)
Unit: milliseconds
expr min lq mean median uq max neval
reduce 9.922728 10.116613 11.422823 10.226871 11.803204 25.453557 100
rbind 2.596139 2.734751 2.916620 2.850199 3.113995 3.453326 100
union 5.393815 5.725917 6.221544 5.906222 6.758622 14.019206 100