1

I'm new to data.table and seem to be missing something obvious. I have a table:

DT = data.table(A = c("x","y","y","z"), B = c("y","x","x","z"), value = 1:4)
setkey(DT, A, B)

Now I want to find all rows where either A or B is "y" (using binary search, my actual tables are larger and the operation has to be performed millions of times). I couldn't figure out how to do this in one statement, since,

DT[.("y", "y"), nomatch=0]

Gives me only rows where (A & B) == "y" (but I want (A | B) == "y"). What I'm doing now is:

uA <- unique(DT[, A])
rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])

But I feel like there must be a more intuitive way.

Thanks for you help!


Benchmark

Including code adapted from @Frank's comment on Binary search DT with key on two columns using alternative (OR) instead of a conjunction

n = 1e6
DT = data.table(A = sample(letters, n, replace = TRUE), 
                B = sample(letters, n, replace = TRUE), value = 1:n)
setkey(DT, A, B)
uA <- unique(DT[, A])

library(microbenchmark)
Union = function(){
   mya = DT[A=="y", which=TRUE]
   myb = DT[B=="y", which=TRUE]
   DT[union(mya,myb)] 
} 
microbenchmark(
    "reduce" = DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]],
    "rbind" = rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0]),
    "union" = Union()
)

Unit: milliseconds
   expr      min        lq      mean    median        uq       max neval
 reduce 9.922728 10.116613 11.422823 10.226871 11.803204 25.453557   100
  rbind 2.596139  2.734751  2.916620  2.850199  3.113995  3.453326   100
  union 5.393815  5.725917  6.221544  5.906222  6.758622 14.019206   100
Henrik
  • 65,555
  • 14
  • 143
  • 159
Fridolin Linder
  • 401
  • 6
  • 12

1 Answers1

1

We could use Reduce with | to get a logical vector that checks either ones of the columns mentioned in the .SDcols have the value 'y' and use it for subsetting the rows

DT[DT[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]

Benchmarks

set.seed(24)
DT = data.table(A = sample(letters, 1e7, replace = TRUE), 
                B = sample(letters, 1e7, replace = TRUE), value = 1:1e7)

DT1 <- copy(DT)
system.time({
      setkey(DT, A, B)
    uA <- unique(DT[, A])
    rbind(DT[.(uA, "y"), nomatch=0], DT[.("y"), nomatch=0])
     })
# user  system elapsed 
#  1.14    0.19    0.87 

system.time({
   DT1[DT1[, Reduce('|', lapply(.SD, '==', 'y')), .SDcols = A:B]]
   })
#  user  system elapsed 
#  0.17    0.02    0.19 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I just checked it again and this solution actually is only faster for small tables. For larger tables it scales poorly (does it maybe use a vector scan?). See the edit to my question – Fridolin Linder Dec 10 '17 at 19:29
  • @FridolinLinder You are not including the `setkey` option in the benchmark. Also, try to check with more number of columns i.e. 15 or so – akrun Dec 10 '17 at 19:30
  • @FridolinLinder I added the system.time which will also consider the `setkey`. I am not using the microbenchmark, because one the setkey is done, the repeated actions will be faster. Also, you can consider that if there are more number of columns, then the method with `rbind` could become tedious – akrun Dec 10 '17 at 19:37
  • Does that mean that your solution doesn't rely on `setkey()`? (Sorry I don't completely understand the details). In that case, I would lose the benefit of ordering the data once and then getting the speedup for the binary search. That's specific to my application, but on the other hand, data.tables binary search never gives you a speedup if the ordering step would be repeated on each subsetting operation. – Fridolin Linder Dec 10 '17 at 19:39
  • @FridolinLinder No, if you check my solution, I am using a copy of 'DT" and it doesn't have any `key` i.e. `key(DT)# [1] "A" "B"; key(DT1) #NULL` You definitely get a speed up, if the setkey timings are not considered. – akrun Dec 10 '17 at 19:41
  • Yeah that's what I meant. If it doesn't have a key, it does a (linear) vector search every time I repeat the subsetting operation (which is O(n). Whereas if a key is set once (time consuming) every following operation is O(log n). I should have probably emphasized more in my question that the subsetting operation is repeated often (millions of times) so it worth spending some time once to order the data (`setkey()`). I appreciate your answer though! – Fridolin Linder Dec 10 '17 at 19:44
  • @FridolinLinder Okay, I got it. But, suppose if you have a 5 or 6 columns to do this, then I am not sure how your `rbind` task would be done and that too could involve multiple steps based on your code – akrun Dec 10 '17 at 19:46