11

I would like to subset my data based on multiple inequality conditions using the data.table package. The examples in the data.table manual show how to do this with character variables, but not with numeric inequalities. I also see how to do this using the subset function. But I really would like to take advantage of the data.table binary search speed. Below is an example of what I am trying to do.

library(data.table)

data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
data

setkey(data, X, Y, Z)

#the data.frame way
data[X > 0 & Y > 0 & Z > 0]

#the data.table way (does not work as I expected)
data[J(>0, >0, >0)]
user1491868
  • 596
  • 4
  • 15
  • 42
  • 4
    Wait, but..."the data.frame way" works here, right? And in fact, it would fail for a data.frame. `J` is for "joining", which subsets the object as a first step; but the "data.frame way" is the normal way of subsetting by inequalities. – Frank Sep 27 '13 at 14:28
  • 2
    Oh also, to clarify, `J` would take values of X, Y and Z, not conditions on them; and then merge/join on those values. And "data" needs to be keyed for this to work. – Frank Sep 27 '13 at 14:32
  • 9
    @Frank +1 We should probably do some optimization of `i` so that inequalities used the key under the hood. I imagine people get the point in the introduction vignette not to use `==` and then expect a similar concept to apply to vector scanning `>`. Maybe even `==` should use the key under the hood, since it's more natural. – Matt Dowle Sep 27 '13 at 14:55
  • 2
    @MatthewDowle Yeah, that would be cool. Something like two dots or `.&(condition for keyvar1, cond_keyvar2,...)`? I guess currently you need to do `tmp <- CJ(keyvars)[conds]; DT[tmp]` – Frank Sep 27 '13 at 15:19
  • 6
    @Frank Even simpler: `DT[X > 0 & Y > 0 & Z > 0]` would just use the key automatically. Since this is R we can take the `i` expression and optimize it before evaluation. – Matt Dowle Sep 27 '13 at 15:31
  • 2
    @MatthewDowle Oh right. I don't know the technical side at all, but if you allowed messier conditions (besides equality and inequality), like `(!(x %in% 2:3)) & y %% 2` (the latter for odd numbers), that could be really powerful. And an expression like that would (maybe?) be hard to parse... that's why I was thinking of constraining it with a `..()` or something. I'm guessing equality and inequality make sense/are good here because there's already sorting on the keys? – Frank Sep 27 '13 at 15:46
  • 3
    @Frank Exactly. If we can do `==`, `>` and `<` combined with any combination of `|` or `&` then that'll be a start. If it isn't an optimized pattern then it'll just fall back to regular vector scan. Secondary keys could be automatically built and cached. – Matt Dowle Sep 27 '13 at 16:27
  • @MattDowle this would be really useful, specially for dates. Probably leverage on `between`? – Juancentro Apr 21 '14 at 19:59

2 Answers2

0

The solution is quite fast and straightforward using the package dplyr.

install.packages(dplyr)
library(dplyr)

newdata <- filter(data, X > 0 , Y > 0 , Z > 0)

dplyr is showing to be one of the easiest and fastest packages for managing data frames. Check this great tutorial here: http://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html

The RStudio team have alsoe produced a nice Cheat Sheet, here: http://www.rstudio.com/resources/cheatsheets/

rafa.pereira
  • 13,251
  • 6
  • 71
  • 109
  • How is it faster or simpler than just `data[X > 0 & Y > 0 & Z > 0]`??? Did you benchmark? Or do you find writing `filter` is simpler than writing `[`? – David Arenburg Feb 12 '15 at 13:59
  • I didn't mean fastEr or simplEr , but only that dplyr solves this problem in a fast and simple way. Thanks for the heads up. – rafa.pereira Feb 12 '15 at 16:17
0

I run some benchmarks

library(dplyr)
library(data.table)
library(microbenchmark)

dt.data.frame.way <- function(data) data[X > 0 & Y > 0 & Z > 0]
dplyr.way <- function(df) filter(df, X > 0, Y > 0, Z > 0)
real.data.frame.way <- function(df) df[df$X > 0 & df$Y > 0 & df$Z > 0,]

data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
setkey(data, X, Y, Z)
df <- as.data.frame(data)

microbenchmark(times = 10,
               dt.data.frame.way(data),
               dplyr.way(df),
               real.data.frame.way(df))
# Unit: microseconds
#                     expr     min       lq       mean    median       uq        max neval
#  dt.data.frame.way(data) 710.426  754.287   871.8784  824.7565  942.998   1180.458    10
#            dplyr.way(df) 951.309 1045.246 12303.3462 1142.7440 1246.668 112775.934    10
#  real.data.frame.way(df) 137.239  162.591   181.5254  187.9785  197.373    231.594    10

Simple clone example data to 5.5M rows.

data <- data.table(X=seq(-5,5,1), Y=seq(-5,5,1), Z=seq(-5,5,1))
data <- rbindlist(lapply(1:5e5, function(i) data)) # 5500000 rows
setkey(data, X, Y, Z)
df <- as.data.frame(data)

microbenchmark(times = 10,
               dt.data.frame.way(data),
               dplyr.way(df),
               real.data.frame.way(df))
# Unit: milliseconds
#                     expr      min        lq      mean    median        uq       max neval
#  dt.data.frame.way(data) 656.2978  668.0560  730.9246  696.6560  831.0877  846.0517    10
#            dplyr.way(df) 632.4096  639.1141  709.4308  678.9436  717.3018 1015.7663    10
#  real.data.frame.way(df) 964.4298 1022.1772 1075.8448 1077.4437 1125.0037 1192.7410    10

Performance of that task seems to be hard to improve. Often it depends on the data.

jangorecki
  • 16,384
  • 4
  • 79
  • 160