unique working incorrectly with data.table

Question

I've discovered some interesting behavior in data.table, and I'm curious if someone can explain to me why this is happening. I'm merging two data.tables (in this MWE, one has 1 row and the other 2 rows). The merged data.table has two unique rows, but when I call unique() on the merged data.table, I get a data.table with one row. Am I doing something wrong? Or is this a bug?

Here's an MWE:

library(data.table)
X = data.table(keyCol = 1)
setkey(X, keyCol)
Y = data.table(keyCol = 1, otherKey = 1:2)
setkeyv(Y, c("keyCol", "otherKey"))
X[Y, ] # 2 unique rows
unique(X[Y, ]) # Only 1 row???

I'd expect unique(X[Y, ]) to be the same as X[Y, ] since all rows are unique, but this doesn't seem to be the case.

Check `attr(X[Y], "sorted")`. It is keyed by `keyCol` so this is what `unique` uses. Try then `Y[X]` and then check `attr(Y[X], "sorted")` and you'll figure it out. — David Arenburg, Feb 25 '15 at 13:09
So if two rows have different values but the same key, one will be removed by data.table:::unique.data.table? — random_forest_fanatic, Feb 25 '15 at 13:12
I'm not sure what you mean by that, but actually you should check keys just with `key(X[Y])` and `key(Y[X])` — David Arenburg, Feb 25 '15 at 13:14
@DavidArenburg I'm still a bit confused... What does Y[X, ] have to do with this problem? I know that Y has more keys than X, but I don't understand why unique(X[Y, ]) will "de-duplicate" two rows with the same key but with different values. — random_forest_fanatic, Feb 25 '15 at 13:17
@Arun: thanks, that makes sense. If you want to write that as an answer I'll accept it! — random_forest_fanatic, Feb 25 '15 at 13:23

score 5 · Accepted Answer · answered Feb 25 '15 at 18:09

5

The default value to by argument for unique.data.table is key(x). Therefore, if you do unique(x) on a keyed data.table, it only looks at the key columns. To override it, do:

unique(x, by = NULL)

by = NULL by default considers all the columns. Alternatively you can also provide by = names(x).

answered Feb 25 '15 at 18:09

Arun

116,683
26
284
387

1

At least since version 1.10, unique seems to default to uniqueness by all columns for data.table https://github.com/Rdatatable/data.table/commit/11e6497b6ebf209c2376a20c0891e52e191bd96f – Valentas Apr 11 '17 at 11:29

unique working incorrectly with data.table

1 Answers1