I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:
set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n),
init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s]
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
# user system elapsed
# 0.65 0.01 0.67
system.time(setkey(DT,s,exp,init))
# user system elapsed
# 41.56 0.03 41.59
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
# user system elapsed
# 0 0 0
identical(ans1,ans2)
# [1] TRUE
Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.
Many thanks.