1

Does anyone know how can I quickly and efficiently check existence of point P(x,y) in two columns datatable? Example code:

dt <- data.table(x=c(1,2,3,4,5), y = c(2,3,4,5,6))
P <- c(2,3)

My desired output in TRUE (because second row in dt contains my point P). I tried with

 P %in% dt

but it worked only with first row, I tried with loops, but without many hope - I am looking for efficient 'data.table' - style solution.

Adamek
  • 95
  • 8

1 Answers1

2

Expanding on the comments.

From Arun's post in What is the purpose of setting a key in data.table? that Frank provided in comment:

  1. Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.

and

It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

Hence, the OP's quickly and efficiently 'data.table' - style solution really depends on the dimension of the problem, i.e. size of dataset and the number of searches that will be performed.

Here are some timings if both are large:

data:

library(data.table)
set.seed(0L)
M <- 1e7
dtKeyed <- data.table(x=1:M, y=2:(M+1)) #R-3.4.4 data.table_1.10.4-3 win-x64
dtNoKey <- copy(dtKeyed)
system.time(setkey(dtKeyed, x, y)) #not free
dtKeyed

nsearches <- 1e3
points <- apply(matrix(sample(M, nsearches*2, replace=TRUE), ncol=2), 1, as.list)

variations:

findPtNoKey <- function() {
    lapply(points, function(p) dtNoKey[p, on=names(dtNoKey), .N > 0, nomatch=0])
}

findPtOnKey <- function() {
    lapply(points, function(p) dtKeyed[p, on=names(dtKeyed), .N > 0, nomatch=0])
}

findPtKeyed <- function() {
    lapply(points, function(p) dtKeyed[p, .N > 0, nomatch=0])
}

library(microbenchmark)
microbenchmark(findPtKeyed(), findPtOnKey(), findPtNoKey(), times=3L)

timings:

#rem to add back the timing from setkey into the timing for findPtKeyed

Unit: milliseconds
          expr         min          lq        mean      median          uq         max neval
 findPtKeyed()    924.6846    928.3025    946.0892    931.9205    956.7914    981.6624     3
 findPtOnKey()   1119.9686   1129.5641   1143.4505   1139.1597   1155.1915   1171.2233     3
 findPtNoKey() 146186.2216 154934.5463 161016.1277 163682.8709 168431.0807 173179.2905     3

accuracy checks:

ref <- findPtNoKey()

identical(findPtKeyed(), ref)
#[1] TRUE

identical(findPtOnKey(), ref)
#[1] TRUE
chinsoon12
  • 25,005
  • 4
  • 25
  • 35
  • 1
    Maybe worth mentioning: this timing is if you have to do the searches sequentially (eg, you don't know the full set of searches to be made at the start). If you can make them all at once, there's `PDT = setnames(rbindlist(points), names(dtNoKey)); dtNoKey[PDT, on=names(dtNoKey), .N > 0, nomatch=0, by=.EACHI]` which is pretty fast. – Frank May 22 '18 at 08:52
  • Agreed. It all depends on the dimension of OP's problem – chinsoon12 May 22 '18 at 09:15