1

I've discovered some interesting behavior in data.table, and I'm curious if someone can explain to me why this is happening. I'm merging two data.tables (in this MWE, one has 1 row and the other 2 rows). The merged data.table has two unique rows, but when I call unique() on the merged data.table, I get a data.table with one row. Am I doing something wrong? Or is this a bug?

Here's an MWE:

library(data.table)
X = data.table(keyCol = 1)
setkey(X, keyCol)
Y = data.table(keyCol = 1, otherKey = 1:2)
setkeyv(Y, c("keyCol", "otherKey"))
X[Y, ] # 2 unique rows
unique(X[Y, ]) # Only 1 row???

I'd expect unique(X[Y, ]) to be the same as X[Y, ] since all rows are unique, but this doesn't seem to be the case.

random_forest_fanatic
  • 1,232
  • 1
  • 12
  • 30

1 Answers1

5

The default value to by argument for unique.data.table is key(x). Therefore, if you do unique(x) on a keyed data.table, it only looks at the key columns. To override it, do:

unique(x, by = NULL)

by = NULL by default considers all the columns. Alternatively you can also provide by = names(x).

Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    At least since version 1.10, unique seems to default to uniqueness by all columns for data.table https://github.com/Rdatatable/data.table/commit/11e6497b6ebf209c2376a20c0891e52e191bd96f – Valentas Apr 11 '17 at 11:29