20

What happens if I want to select all the rows in a data.table that do not contain a particular value in the key variable using binary search? By the way, what is the correct jargon for what I want to do? Is it "nojoin"? Is it "negative selection"?

DT = data.table(x=rep(c("a","b","c"),each=3), y=c(1,3,6), v=1:9)
setkey(DT,x)

Lets do a positive selection for all rows where x=="a" but using binary search

DT["a"]

That's beautiful but I want the opposite of that. I want all the rows that are not "a" in other words where x!="a"

DT[x!="a"]

That is a vector scanning. The above line works but is uses vector scanning. I want to use binary. I was expecting the following to work, but alas...

DT[!"a"]
DT[-"a"]

The above two do not work and trying to play with nomatch got me nowhere.

Farrel
  • 10,244
  • 19
  • 61
  • 99
  • Did you try `DT[x!="a"]`? Or are you asking if there is a simpler way without specifying the key? – Gavin Simpson Sep 07 '12 at 13:39
  • 4
    @GavinSimpson I did try `DT[x!="a"]` and it works. However, that method uses vector scanning which will scale slowly for big data sets. I believe that vector scanning is what we try to get away from in data.table syntax. – Farrel Sep 07 '12 at 13:45
  • OK, that wasn't 100% clear to me from the Q. Can't help further as not a big data.table user. There are a few hereabouts though. – Gavin Simpson Sep 07 '12 at 13:47
  • 4
    FYI the common name for this operation is an anti-join – hadley Sep 25 '13 at 14:49

2 Answers2

20

The idiom is this:

DT[-DT["a", which=TRUE]]

   x y v
1: b 1 4
2: b 3 5
3: b 6 6
4: c 1 7
5: c 3 8
6: c 6 9

Inspiration from:


Update. New in v1.8.3 is not-join syntax. Farrel's first expectation (! rather than -) has been implemented :

DT[-DT["a",which=TRUE,nomatch=0],...]   # old idiom
DT[!"a",...]                            # same result, now preferred.

See the NEWS item for more detailed info and example.

Community
  • 1
  • 1
Andrie
  • 176,377
  • 47
  • 447
  • 496
  • I notice that the resulting data.table loses its `key=` info. If that's not desired, try this slight modification: `res <- setkeyv(DT[-DT["a", which=TRUE]], key(DT))`. – Josh O'Brien Sep 07 '12 at 19:23
  • 2
    @JoshO'Brien Oops, well spotted. Have just filed [FR#2266 - All negative numeric i should retain key](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2266&group_id=240&atid=978). – Matt Dowle Sep 09 '12 at 22:30
  • 1
    @JoshO'Brien, Andrie `DT[!"a"]` now in v1.8.3. Very grateful if you have a moment to check it and shout if you foresee any issues? Thanks. Just added update to Andrie's answer. – Matt Dowle Oct 25 '12 at 00:28
  • @MatthewDowle -- That's terrific. I'll take a look as soon as I can. Cheers. – Josh O'Brien Oct 25 '12 at 00:31
3

Andrie's answer is great, and is what I'd probably use. Interestingly, though, the following construct seems to be (just a bit) faster, especially as the size of the data.tables increase.

DT[J(x = unique(DT)[x!="a"][,x])]

##-------------------------------- Timings -----------------------------------##

library(data.table)
library(rbenchmark)

DT = data.table(x=rep(c("a","b","c"),each=45e5), y=c(1,3,6), v=1:9, key="x")
Josh <- function() DT[J(x = unique(DT)[x!="a"][,x])]
Andrie <- function() DT[-DT["a", which=TRUE]]

## Compare results
identical(Josh(), setkey(Andrie(), "x"))  
# [1] TRUE

## Compare timings
benchmark(replications = 10, order="relative", Josh=Josh(), Andrie=Andrie())
    test replications elapsed relative user.self sys.self user.child sys.child
1   Josh           10   17.50    1.000     14.78      3.6         NA        NA
2 Andrie           10   18.75    1.071     16.52      3.2         NA        NA

I'd be especially tempted to use this if DT[,x] could be made to return a data.table rather than a vector. Then, the construct could be simplified a bit to DT[unique(DT[,x])[x!="a"]]. Also, it would then work even when there are mulitiple columns in the key, which it currently does not.

Josh O'Brien
  • 159,210
  • 26
  • 366
  • 455
  • Wow! Fantastic to see friendly competition. Benchmarks using the fastest of 3 runs (such as `microbenchark`, despite its name suggesting it's only for small timings, it reports min and max of by default) is a bit more industry standard: cache effects can be significant in benchmarks like this. – Matt Dowle Sep 07 '12 at 19:04
  • Might also be better with a range of inputs before concluding; e.g., where the not-join selects a large subset vs a small, and where the number of groups, and size of each group, are varied too. Known serious slow down [Bug#2216](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2216&group_id=240&atid=975) must bite with some cases, too. – Matt Dowle Sep 07 '12 at 19:06
  • Further, `unique(DT)` will slow down as the number of columns increases, depending on how much uniqueness there is. And factor columns might be useful here, to quickly do a `setdiff` on the levels only. In summary, the relative merits of each approach could stretch to several pages of analysis! Once you've worked it out (and I don't know myself) `DT[-"a"]` could switch to the best method automatically when implemented [FR#1384](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=1384&group_id=240&atid=978)! :) – Matt Dowle Sep 07 '12 at 19:12
  • Darn -- I was hoping that just throwing this out into the ether (for you to catch) would get me a definitive answer ;) Based on the points you make, I'd be even more inclined to stick with Andrie's approach (taking care to reset the key info that it drops). – Josh O'Brien Sep 07 '12 at 19:22