Fast EXISTS in data.table

Question

What is the fastest way to check if a value exists in a data.table?. Suppose that

dt is a data.table of n columns with k columns being the key
keys is a list, or a value, or a data.table, or anything that can be used in the i argument of [.data.table

I'm currently doing

NROW(dt[keys,nomatch=0])!=0

Is there anything faster?

Example

require(data.table)
iniDate = as.Date("2000-01-01")
theData = data.table(a = LETTERS, d = seq(from=iniDate ,to= iniDate + length(LETTERS)*3000-1,by="day"))
theKeys = data.table(a = c("J","M","T"), d = as.Date(c("2005-01-20","2005-05-20","2013-01-12")))
setkey(theData,a,d)
NROW(theData[theKeys],nomatch=0)!=0

it's always nicer to give an example to show what you're trying to do. You say "a value". And are you checking for only one value? Also are you checking in only one column of the data.table `dt`? Is "keys" a 1-column data.table? As such I find this question vague. — Arun, Jun 27 '13 at 06:36
@Arun Sorry for the vagueness, hope the recent edit helps. I'm looking for a general solution, not just for the case where the i argument is a value or the key is just one column — Juancentro, Jul 01 '13 at 21:22
you'll have to explain by editing your question a bit more detailed than that. — Arun, Jul 01 '13 at 21:28
@Arun added an example, hope that helps. Notwithstanding that, your answer is accepted — Juancentro, Jul 04 '13 at 19:47

score 7 · Accepted Answer · answered Jun 27 '13 at 06:59

Short answer: In addition to nomatch=0, I think mult="first" would help speed it even more.

Long answer: Assuming that you want to check if a value (or more than 1 value) is present in the key column of a data.table or not, this seems to be much faster. The only assumption here is that the data.table has only 1 key column (as this is quite ambiguous to me).

my.values = c(1:100, 1000)
require(data.table)
set.seed(45)
DT <- as.data.table(matrix(sample(2e4, 1e6*100, replace=TRUE), ncol=100))
setkey(DT, "V1")
# the data.table way
system.time(all(my.values %in% .subset2(DT[J(my.values), mult="first", nomatch=0], "V1")))
   user  system elapsed 
  0.006   0.000   0.006 

# vector (scan) approach
system.time(all(my.values %in% .subset2(DT, "V1")))
   user  system elapsed 
  0.037   0.000   0.038

You can change all to any if you want to check if at least 1 value is present in the subset or not. The only difference between the two is that you first subset using data.table's approach (taking advantage of key and mult argument). As you can see the it's extremely faster (and also scales well). And then to retrieve the key columns from the subset (call it the_subset),

.subset2(the_subset, "V1") (or) the_subset$V1 (or) the_subset[["V1"]]

But, the_subset[, V1] will be slower.

Of course the same idea could be extended to many columns as well, but I'll have to know exactly what you want to do after.

Thanks Arun, just retrieving the first result will surely speed up the computations — Juancentro, Jul 01 '13 at 22:36
I'm struggling to try to extend this answer to multiple key columns right now. — mako, Feb 24 '15 at 17:00

score 3 · Answer 2 · answered Jun 26 '13 at 23:08

3

How about the base R idiom:

any(my.value %in% my.vector)

This is not a data.table specific idiom but is quite efficient I believe.

answered Jun 26 '13 at 23:08

asb

4,392
1
20
30

1

`any(my.value == my.vector)` is faster in my testing, albeit only 0.3s faster when `my.vector` is 10M values long. – thelatemail Jun 26 '13 at 23:14
Bummer! Sorry to be of only marginal help. :D – asb Jun 26 '13 at 23:16
@thelatemail presumably `my.value` is of length greater than 1 since it's called `keys` in the OP. – GSee Jun 26 '13 at 23:42
@thelatemail: Eh! I read that you meant `==` is faster than `%in%` only now! Plus, I thought you are the person asking the question. Apologies, it is too late in my part of the world. – asb Jun 26 '13 at 23:53

Fast EXISTS in data.table

2 Answers2

Linked