3

I'd like to subset a data.table by choosing the first key and excluding the second key.

set.seed(18032)
DT <- data.table(grp1 = sample(10, 1000, T),
                 grp2 = sample(10, 1000, T),
                 v = rnorm(100), key = "grp1,grp2")

My first instinct didn't work (! operated too early):

DT[.(10, !10)] #!10 = 0, chooses the (10,0) subset

This seems too inelegant, but works:

DT[.(10, setdiff(unique(grp2), 10))] #unique(grp2) %\% 10 for the bold ;-)

And this also works, but this approach sacrifices some functionality (e.g., access to := on DT):

setkey(DT, grp2, grp1)
DT[!.(10)][CJ(grp2, 10, unique = TRUE)]
#equivalently
DT[!.(10)][.(unique(grp2), 10)]

Have I exhausted my options, or am I missing something?

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198
  • What is the "(10,0) subset". There are no 0's – IRTFM Feb 04 '16 at 02:40
  • @42- indeed, it returns something because `nomatch = NA`. Would have returned a 0-row table if we set `nomatch = 0`. – MichaelChirico Feb 04 '16 at 02:52
  • It should be faster to first subset the value of `==` filter, and then `!=`, just be reducing - potentially heavily - the row number of data. – jangorecki Feb 04 '16 at 09:52
  • I guess [this is](http://stackoverflow.com/questions/15597685/subsetting-data-table-by-2nd-column-only-of-a-2-column-key-using-binary-search) somewhat related. – David Arenburg Feb 04 '16 at 10:20
  • The `.()` syntax has to conform to `list()`, since it's converted to a data.table. I don't have much thoughts on how to go about it. Perhaps auto indexing extension could help these cases. – Arun Feb 04 '16 at 10:43

1 Answers1

2

This seems to do what I expected:

DT[ grp1==10 & grp2 != 10, ]

It seems to allow targeted assignment if you use := in the j -position.

As an example, this should succeed (with no loss of efficiency):

 DT[ grp1==10 & grp2 != 10, v := 0 ]
IRTFM
  • 258,963
  • 21
  • 364
  • 487
  • Is there any downside to this not directly accessing the key values? I've no idea, but thought it might be worth noting. – thelatemail Feb 04 '16 at 02:51
  • This is another option, but I believe (?) it fails to take advantage of binary sorting – MichaelChirico Feb 04 '16 at 02:53
  • Actually some benchmarking suggests they're comparable in speed... I guess this is auto-indexing at work (?) – MichaelChirico Feb 04 '16 at 02:57
  • I'm not a master of data.table methods, but I _thought_ that this was teh original method for proper selection and assignment. – IRTFM Feb 04 '16 at 07:08
  • splitting into two `[` calls can give speed up as you can utilize key for `grp1==10`, and perform second subset on small dataset – jangorecki Feb 04 '16 at 09:53
  • My point was that you could do an assignment in the j-argument while doing a selection in the i-argument – IRTFM Feb 04 '16 at 10:04
  • 1
    @jangorecki it depends on the data size I guess, as `[.data.table` creates an additional (unnecessary) copy. Btw, when working with an integer column, it is better explicitly compare against integers in order to avoid the conversion afterwards, so this should be `DT[ grp1==10L & grp2 != 10L]` – David Arenburg Feb 04 '16 at 10:18
  • 1
    And no, this solution doesn't use the `bmerge`. This can be easily validated by running `options(datatable.verbose = TRUE)` in the line before. – David Arenburg Feb 04 '16 at 10:29