Strange behaviour subsetting by multiple keys using data.table

Question

I've set multiple keys in a data.table, but when I try and select rows by multiple values of the keys, it seems to return a row for each potential combination, but filled with NA for rows that do not exist.

I can get the sample code in 1c of this document, so it must be something that I'm just not seeing. Any help would be much appreciated.

library(data.table)

dt = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

setkey(dt,colA,colB)

print(dt)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    B   13
# 4:    4    B   14

print(
  dt[.(2,"A")]
)
# As expected
# colA colB colC
# 1:    2    A   12

print(
  dt[.(c(2,3),"A")]
)
# colA colB colC
# 1:    2    A   12
# 2:    3    A   NA #Unexpected

print(
  dt[.(unique(colA),"A")]
)
# colA colB colC
# 1:    1    A   11
# 2:    2    A   12
# 3:    3    A   NA #Unexpected
# 4:    4    A   NA #Unexpected

Good question. Add `nomatch=0`, like `dt[.(unique(colA),"A"), nomatch=0]`. This is covered in the keys vignette: `browseVignettes("data.table")` or see the website: https://github.com/Rdatatable/data.table/wiki/Getting-started Ah, I see you already linked there. — Frank, Jun 15 '17 at 23:46
Thanks so much Frank. That worked, and I'll accept the answer if you submit it. I would be interested in knowing why that is not required for the tutorial's flights data though. — Matt, Jun 15 '17 at 23:48
Sure, could you be more specific about what you mean by "is not required"? — Frank, Jun 15 '17 at 23:49
By not required I meant that there are no NA columns returned by `flights[.(unique(origin), "MIA")]`, but I realise that it must be simply because there are no combinations without a match. Thanks again. — Matt, Jun 15 '17 at 23:51

score 4 · Accepted Answer · answered Jun 15 '17 at 23:57

DT[i] will look up each row of i in rows of DT. By default, unmatched rows of i are shown with NA. To instead drop unmatched rows, use nomatch = 0:

dt[.(unique(colA),"A"), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

The nomatch argument is covered in the vignette the OP linked. To find the latest version of the vignette, use browseVignettes("data.table").

As a side note, it is not necessary to set keys before joining any more. Instead, on= can be used:

library(data.table)
dt2 = data.table(colA = 1:4,
                colB = c("A","A","B","B"),
                colC = 11:14)

dt2[.(unique(colA),"A"), on=.(colA, colB), nomatch=0]

#    colA colB colC
# 1:    1    A   11
# 2:    2    A   12

See Arun's answer for details on why keying is not usually necessary to improve performance in joins. It says:

Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

I typically only set keys when I am doing joins interactively, so I can skip typing out the on=.

Strange behaviour subsetting by multiple keys using data.table

1 Answers1