R - Data.table fast binary search based subset with multiple values in second key

Question

I have come across this vignette at https://cran.r-project.org/web/packages/data.table/vignettes/datatable-keys-fast-subset.html#multiple-key-point.

My data looks like this:

ID    TYPE     MEASURE_1    MEASURE_2
1     A        3            3
1     B        4            4
1     C        5            5
1     Mean     4            4
2     A        10           1
2     B        20           2
2     C        30           3
2     Mean     20           2

When I do this ... all works as expected.

setkey(dt, ID, TYPE)
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A
dt[.(unique(ID), "B")] # extract SD of all IDs with Type B
dt[.(unique(ID), "C")] # extract SD of all IDs with Type C

Whenever I try sth like this, where I want to base the keyed subset on multiple values for the second key, I only get the result of the all combinations of unique values in key 1 with only the first value defined in the vector c() for the second key. So, it only takes the first value defined in the vector and ignores all following values.

# extract SD of all IDs with one of the 3 types A/B/C    
dt[.(unique(ID), c("A", "B", "C")] 

# previous output is equivalent to 
dt[.(unique(ID), "A")] # extract SD of all IDs with Type A

# I want/expect
dt[TYPE %in% c("A", "B", "C")]

What am I missing here or is this sth I cannot do with keyed subsets?

To clarify: As I cannot leave out the key 1 in keyed subsets, the vignette calls for inclusion of the first key with unique(key1)

And defining multiple keys in key 1 works also as expected.

dt[.(c(1, 2), "A")] == dt[ID %in% c(1,2) & TYPE == "A"] # TRUE

You can get what you want with `CJ` instead of `.` to create all combinations of your two vectors. I'm not sure how `data.table` is supposed to behave when you try to join on a ragged array, but I guess it just ignores all the vectors of different lengths to the first one. EDIT: Just tested it, and it recycles to the longest vector as is typical in R. — pseudospin, Feb 27 '21 at 13:04
Didn't think of this solution before as I have mainly used this for other use cases than subsetting data. But I guess it's a viable solution. Do you happen to know though if what I want to do is actually possible with keyed subsets? I happen to like the simplicity of the syntax very much as it is way more readable — aimbotter21, Feb 27 '21 at 13:10
Shouldnt then be ```dt[CJ(ID, c("A", "B", "C")]``` be equal to ```dt[TYPE %in% c("A", "B", "C"]``` for example. With my toy data, I get a ```FALSE``` when comparing both with ```identical()``` — aimbotter21, Feb 27 '21 at 13:16
You dropped the `unique`? You'll get multiple copies of each ID - just look at the tables that are produced. — pseudospin, Feb 27 '21 at 13:49
You are looking for a semi-join. Without the unique, it would be more like a right join which is why you would see duplicates. To get expected results you could do as `dt[unique(dt[, .(ID, TYPE)], .SD]` or as @pseudospin notes, your example would be `dt[CJ(unique(ID), c("A","B"))`. To do a true semi-join, see this https://stackoverflow.com/questions/18969420/perform-a-semi-join-with-data-table — Cole, Feb 27 '21 at 13:55

score 2 · Accepted Answer · answered Feb 28 '21 at 11:31

In the data.table documention (see help("data.table") or https://rdatatable.gitlab.io/data.table/reference/data.table.html#arguments), it is mentioned :

character, list and data.frame input to i is converted into a data.table internally using as.data.table.

So, the classical recycling rule used in R (or in data.frame) applies. That is, .(unique(ID), c("A", "B", "C")), which is equivalent to list(unique(ID), c("A", "B", "C")), becomes:

as.data.table(list(unique(ID), c("A", "B", "C")))

and since the length of the longest list element (length of c("A", "B", "C")) is not a multiple of the shorter one (length of unique(ID)), you will get an error. If you want each value in unique(ID) combined with each element in c("A", "B", "C"), you should use CJ(unique(ID), c("A", "B", "C")) instead.

So what you should do is dt[CJ(unique(ID), c("A", "B", "C"))].

Note that dt[.(unique(ID), "A")] works correctly because you passed only one element for the second key and this gets recycled to match the length of unique(ID).

Thanks man! That's what I have come to understand after various testing as well. It is great to have some guidance with clear examples, as answers in the comments always are hard to read. But, your proposed answer was very close to the other ones proposed in the comments below my question. Thanks for pointing me to the ```CJ()```! Much appreciated! — aimbotter21, Feb 28 '21 at 18:29

R - Data.table fast binary search based subset with multiple values in second key

1 Answers1