data.table subset on two columns where either key matches

Question

I'm new to data.table and seem to be missing something obvious. I have a table:

DT = data.table(A = c("x","y","y","z"), B = c("y","x","x","z"), value = 1:4)
setkey(DT, A, B)

Now I want to find all rows where either A or B is "y" (using binary search, my actual tables are larger and the operation has to be performed millions of times). I couldn't figure out how to do this in one statement, since,

DT[.("y", "y"), nomatch=0]

Gives me only rows where (A & B) == "y" (but I want (A | B) == "y"). What I'm doing now is:

uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])

But I feel like there must be a more intuitive way.

Thanks for you help!

Benchmark

Including code adapted from @Frank's comment on Binary search DT with key on two columns using alternative (OR) instead of a conjunction

n = 1e6
DT = data.table(A = sample(letters, n, replace = TRUE), 
                B = sample(letters, n, replace = TRUE), value = 1:n)
setkey(DT, A, B)
uA <- unique(DT[, A])

library(microbenchmark)
Union = function(){
   mya = DT[A=="y", which=TRUE]
   myb = DT[B=="y", which=TRUE]
   DT[union(mya,myb)] 
} 
microbenchmark(
    "reduce" = DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]],
    "rbind" = rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0]),
    "union" = Union()
)

Unit: milliseconds
   expr      min        lq      mean    median        uq       max neval
 reduce 9.922728 10.116613 11.422823 10.226871 11.803204 25.453557   100
  rbind 2.596139  2.734751  2.916620  2.850199  3.113995  3.453326   100
  union 5.393815  5.725917  6.221544  5.906222  6.758622 14.019206   100

You may use `DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]` — akrun, Dec 10 '17 at 19:10
Great thanks, that works! (And it's twice as fast as my solution). I'm surprised though that there is not a more intuitive way of doing this. — Fridolin Linder, Dec 10 '17 at 19:16
@Henrik could you post that code as an answer? I couldn't get that solution to produce equivalent output. — Fridolin Linder, Dec 10 '17 at 20:06
@Hendrik see my edits. I got it to work but it seems slower, too. — Fridolin Linder, Dec 10 '17 at 20:26

akrun · Answer 1 · 2017-12-10T19:39:28.927

1

We could use Reduce with | to get a logical vector that checks either ones of the columns mentioned in the .SDcols have the value 'y' and use it for subsetting the rows

DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]

Benchmarks

set.seed(24)
DT = data.table(A = sample(letters, 1e7, replace = TRUE), 
                B = sample(letters, 1e7, replace = TRUE), value = 1:1e7)

DT1 <- copy(DT)
system.time({
      setkey(DT, A, B)
    uA <- unique(DT[, A])
    rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
     })
# user  system elapsed 
#  1.14    0.19    0.87 

system.time({
   DT1[DT1[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]
   })
#  user  system elapsed 
#  0.17    0.02    0.19

edited Dec 10 '17 at 19:39

answered Dec 10 '17 at 19:17

akrun

874,273
37
540
662

I just checked it again and this solution actually is only faster for small tables. For larger tables it scales poorly (does it maybe use a vector scan?). See the edit to my question – Fridolin Linder Dec 10 '17 at 19:29
@FridolinLinder You are not including the `setkey` option in the benchmark. Also, try to check with more number of columns i.e. 15 or so – akrun Dec 10 '17 at 19:30
@FridolinLinder I added the system.time which will also consider the `setkey`. I am not using the microbenchmark, because one the setkey is done, the repeated actions will be faster. Also, you can consider that if there are more number of columns, then the method with `rbind` could become tedious – akrun Dec 10 '17 at 19:37
Does that mean that your solution doesn't rely on `setkey()`? (Sorry I don't completely understand the details). In that case, I would lose the benefit of ordering the data once and then getting the speedup for the binary search. That's specific to my application, but on the other hand, data.tables binary search never gives you a speedup if the ordering step would be repeated on each subsetting operation. – Fridolin Linder Dec 10 '17 at 19:39
@FridolinLinder No, if you check my solution, I am using a copy of 'DT" and it doesn't have any `key` i.e. `key(DT)# [1] "A" "B"; key(DT1) #NULL` You definitely get a speed up, if the setkey timings are not considered. – akrun Dec 10 '17 at 19:41
Yeah that's what I meant. If it doesn't have a key, it does a (linear) vector search every time I repeat the subsetting operation (which is O(n). Whereas if a key is set once (time consuming) every following operation is O(log n). I should have probably emphasized more in my question that the subsetting operation is repeated often (millions of times) so it worth spending some time once to order the data (`setkey()`). I appreciate your answer though! – Fridolin Linder Dec 10 '17 at 19:44
@FridolinLinder Okay, I got it. But, suppose if you have a 5 or 6 columns to do this, then I am not sure how your `rbind` task would be done and that too could involve multiple steps based on your code – akrun Dec 10 '17 at 19:46

data.table subset on two columns where either key matches

Benchmark

1 Answers1

Benchmarks