5

I am trying to find the quickest way to subset a large dataset by several numeric columns. As promised by data.table, the time taken to do binary search is much quicker than for vector scanning. Binary search, however, requires setkey to be performed beforehand. As you see in this code, it takes an exceptionally long time! Once you take that time into account, vector scanning is much much faster:

set.seed(1)
n=10^7
nums <- round(runif(n,0,10000))
DT = data.table(s=sample(nums,n), exp=sample(nums,n), 
         init=sample(nums,n), contval=sample(nums,n))
this_s = DT[0.5*n,s] 
this_exp = DT[0.5*n,exp]
this_init = DT[0.5*n,init]
system.time(ans1<-DT[s==this_s&exp==this_exp&init==this_init,4,with=FALSE])
#   user  system elapsed 
#   0.65    0.01    0.67 
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
#  41.56    0.03   41.59 
system.time(ans2<-DT[J(this_s,this_exp,this_init),4,with=FALSE])
#   user  system elapsed 
#    0       0       0 
identical(ans1,ans2)
# [1] TRUE

Am I doing something wrong? I've read through the data.table FAQs etc. Any help would be greatly appreciated.

Many thanks.

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
mww
  • 101
  • 5
  • Setting a key sorts your data.table and with that many unique key values that is an expensive operation. It's not worth doing that, if you do only one search afterwards. However, the idea is that you set a key once and then enjoy the observed speed-up repeatedly. – Roland Dec 21 '13 at 15:17
  • 2
    Part of the problem is that, the way you have it set up, your key fields are numeric (double precision float), rather than integer. If you set `num <- 1:10000` instead of `num <- round(runif(n,0,1000))` then indexing is about 8 times faster. Still a bit slower than vector scanning without keys in this case, though. – jlhoward Dec 21 '13 at 15:33

1 Answers1

5

The line :

nums <- round(runif(n,0,10000))

leaves nums as type numeric not integer. That makes a big difference. The data.table FAQs and introduction are geared towards integer and character columns; you won't see setkey as slow on those types. For example :

nums <- as.integer(round(runif(n,0,10000)))
...
setkey(DT,s,exp,init)  # much faster now

Two further points though ...

First, the ordering/sorting operations are much faster in the current development version of data.table v1.8.11. @jihoward is right on about sorting on numeric columns being much more time-consuming operation. But, still it's about 5-8x faster in 1.8.11 (because of a 6-pass radix order implementation, check this post). Comparing the time taken for the setkey operation between 1.8.10 and 1.8.11:

# v 1.8.11
system.time(setkey(DT,s,exp,init))
#    user  system elapsed 
#   8.358   0.375   8.844 

# v 1.8.10
system.time(setkey(DT,s,exp,init))
#   user  system elapsed 
# 66.609   0.489  75.216 

It's a 8.5x improvement on my system. So, my guess is this'd take about 4.9 seconds on yours.

Second, as @Roland mentions, if your objective is to perform a couple of subsetting and that is ALL you're going to do, then of course it doesn't make sense to do a setkey as, it has to find the order of columns and then reorder the entire data.table (by reference so that the memory footprint is very minimal, check this post for more on setkey).

Community
  • 1
  • 1
Arun
  • 116,683
  • 26
  • 284
  • 387