Does data.table implement fast range subsetting based on binary search? What is that syntax?

Question

I have a vector of floats. I would like to repeatedly find subsets of that vector within various ranges. My current syntax (DT[x > 1.8 & x < 2.9]) appears to vector scan (it is relatively slow).

Is there a faster syntax utilizing binary search for range/interval based sub-setting of data.tables?

Example:

set.seed(123L)
x = runif(1E6)
DT = data.table(x, key = "x")

# For foverlaps()
DT[,xtemp:=x]
range = data.table(start = 0.04, end = 0.5, key=c("start", "end"))


microbenchmark::microbenchmark(
    DT[x < 0.5 & x > 0.04], 
    x[x < 0.5 & x > 0.04],
    foverlaps(DT, range, by.x = c("x", "xtemp"))
    )

Unit: milliseconds
                                         expr       min        lq      mean    median        uq      max neval
                       DT[x < 0.5 & x > 0.04]  12.65391  16.10852  18.43412  17.23268  17.76868 104.1112   100
                        x[x < 0.5 & x > 0.04]  16.48126  19.63882  21.65813  20.31534  20.95264 113.7965   100
 foverlaps(DT, range, by.x = c("x", "xtemp")) 116.72732 131.93093 145.56821 140.09218 146.33287 226.6069   100

Done any searching? http://stackoverflow.com/search?q=data.table+ranges I do know that Arunkumar has done extensive work building range-oriented functions in data.table and he is a regular contributor to SO. — IRTFM, Nov 17 '16 at 22:05
Yes, the most relevant post was from 2 years ago: http://stackoverflow.com/questions/22320284/subsetting-a-data-table-by-range-making-use-of-binary-search and offered yet to be implemented solutions. I was hoping one had been implemented and I was just unable to find it. — nate, Nov 17 '16 at 22:07
I'll keep looking, though. The Aurnkumar tip was key - I think I found something. — nate, Nov 17 '16 at 22:09
I thought `foverlaps()` might be a solution but it solves a more complex problem and is actually slower. See here: http://stackoverflow.com/questions/24480031/roll-join-with-start-end-window/25655497#25655497 — nate, Nov 17 '16 at 22:23

score 7 · Answer 1 · edited May 23 '17 at 11:45

7

Based on the answer here, this seems to be some sort of improvement. values equal to 0.5 will be included in this scenario though:

bs <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE]; (ind[1]+1):ind[2]}]
vs <- function() x[x < 0.5 & x > 0.04]

x = runif(1E6)
DT = data.table(x, key = "x")

microbenchmark::microbenchmark(
    bs(), 
    vs()
)

#Unit: milliseconds
# expr       min        lq      mean   median        uq        max neval
# bs()  3.594993  4.150932  5.002947  4.44695  4.952283   9.750284   100
# vs() 15.054460 16.217198 18.999877 17.45298 19.554958 113.623699   100

If we modify vs() to be:

vs <- function() x[x <= 0.5 & x > 0.04]

The results from two methods are the same:

identical(bs()$x, sort(vs()))
# [1] TRUE

edited May 23 '17 at 11:45

Community

1
1

answered Nov 17 '16 at 22:22

Psidom

209,562
33
339
356

Still figuring out how this works. (And hoping for a native solution to appear.) Thanks! I'll accept your answer if nothing faster turns up. – nate Nov 17 '16 at 22:40
4

Notably, this solution on the toy data this solution was a 4x speedup. On my actual data (4E6 rows) it was a 50x speedup. – nate Nov 17 '16 at 22:49

nate · Accepted Answer · 2017-11-14T20:02:36.207

A recent version of data.table added the %between% and %inrange% operators which encapsulate this behavior. It seems to be marginally slower that Psidom's roll-based solution but handles all types (numeric/integer) as expected and is much more succinct. See below.

# data.table version 1.10.4
# R version 3.3.1 (2016-06-21)

set.seed(123L)
library(data.table)
x = runif(1E6)
DT = data.table(x)

#Psidom Answer
psidom <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE, on=.(x)]; (ind[1]+1):ind[2]}]

# Unkeyed
microbenchmark::microbenchmark(
  DT[x <= 0.5 & x >= 0.04], 
  x[x <= 0.5 & x >= 0.04],
  DT[x %between% c(0.04, 0.5)],
  DT[x %inrange% c(0.04, 0.5)],
  DT[.(0.04, 0.5), on = .(x >= V1, x <= V2), .(x.x)]
)

# Unit: milliseconds
#                                                expr        min         lq      mean    median        uq      max neval  cld
#                            DT[x <= 0.5 & x >= 0.04]  20.346712  23.983928  34.69493  25.21085  26.73657 281.4747   100  b  
#                             x[x <= 0.5 & x >= 0.04]  19.581049  22.935144  31.61551  23.83557  25.99587 145.3632   100  b  
#                        DT[x %between% c(0.04, 0.5)]   8.024091   9.293261  12.19035  11.38171  12.75843 116.5132   100 a   
#                        DT[x %inrange% c(0.04, 0.5)]  77.108485  79.871207  91.05544  81.83722  84.66684 188.8674   100   c 
#  DT[.(0.04, 0.5), on = .(x >= V1, x <= V2), .(x.x)] 189.488658 195.487681 217.55708 198.52696 205.80428 318.1696   100    d

# Keyed
setkey(DT,x)

#Psidom Answer
psidom <- function() DT[{ind <- DT[.(c(0.04, 0.5)), which=TRUE, roll=TRUE, on=.(x)]; (ind[1]+1):ind[2]}]

microbenchmark::microbenchmark(
  DT[x <= 0.5 & x >= 0.04], 
  x[x <= 0.5 & x >= 0.04],
  DT[x %between% c(0.04, 0.5)],
  DT[x %inrange% c(0.04, 0.5)],
  DT[.(0.04, 0.5), on = .(x >= V1, x <= V2), .(x.x)],
  psidom()
)

# Unit: milliseconds
#                                                expr       min        lq      mean    median        uq      max neval cld
#                            DT[x <= 0.5 & x >= 0.04] 14.550788 18.092458 21.012992 18.934781 20.055428 123.1174   100  b 
#                             x[x <= 0.5 & x >= 0.04] 19.403718 22.401709 27.296872 23.707688 24.608270 128.9123   100  b 
#                        DT[x %between% c(0.04, 0.5)]  5.439340  6.819262 10.789330  9.490118 10.561789 111.6523   100 a  
#                        DT[x %inrange% c(0.04, 0.5)] 12.871260 13.894918 21.434823 16.888748 18.128147 123.4275   100  b 
#  DT[.(0.04, 0.5), on = .(x >= V1, x <= V2), .(x.x)] 49.277678 53.516350 61.422212 54.499675 55.869354 158.1861   100   c
#                                            psidom()  4.615421  5.095880  9.482131  5.325707  8.316817 109.9318   100 a

Please, double check the limits. In some cases it's `x < 0.5 & x > 0.04` but also `x %between% c(0.05, 0.5)` (Note 0.04 vs 0.05). In addition, `%between%` and `%inrange%` use closed intervals by default. For identical results, please, use `between(x, 0.04, 0.5, incbounds = FALSE)`. — Uwe, Nov 10 '17 at 06:53
Anyway, nice benchmark. Perhaps, you can add the *non-equi join* version `DT[.(0.04, 0.5), on = .(x > V1, x < V2), .(x.x)]` for comparison? — Uwe, Nov 10 '17 at 06:54
Watch out. Psidom's solution *requires* to work on a sorted data.table. Otherwise, we get wrong results. So, the BM results for psidom on the non-keyed data.table should be removed. — Uwe, Nov 10 '17 at 07:47

Does data.table implement fast range subsetting based on binary search? What is that syntax?

2 Answers2