Does unique.data.table with by behave like dplyr::distinct with .keep_all = TRUE?

Question

I want to get unique rows in a data frame based on one variable, while still choosing which rows (based on other variables) are included.

Example:

dt <- as.data.table(list(group = c("A", "A", "B", "B", "C", "C"), number = c(1, 2, 1, 2, 2, 1)))

I would normally do this, as it allows me to always keep the row where number == 1.

dt %>% 
arrange(group, number) %>% 
distinct(group, .keep_all = TRUE)

This is now too slow, and I'm hoping the data.table equivalent will be faster.

This seems to work:

dt <- dt[order(group, number)]
unique(dt, by = c("group"))

But I couldn't find anything in the unique.data.table documentation which says that the first row per group is the one which is kept. Is it safe to assume it is?

score 3 · Answer 1 · answered Jun 02 '20 at 21:02

According to documentation

unique returns a data.table with duplicated rows removed, by columns specified in by argument. When no by then duplicated rows by all columns are removed.

We can reason from that it does return the first row by each unique group.

To complement options provided by @Ian here is another one which will probably be the fastest one.

setkeyv(dt, c("group","number"))
unique(dt, by="group")

At least as of now, because there are possible improvements coming. An example of reducing time from 3.544s to 0.075s, it needs an index rather than key, can be found in unique can be optimized on keyed data.tables #2947.

score 2 · Accepted Answer · edited Jun 02 '20 at 22:39

2

How about subsetting .SD in j?

library(data.table)
dt[order(group,number),.SD[1],by=group]
#   group number
#1:     A      1
#2:     B      1
#3:     C      1

You might also find using .I faster because it avoids assembling .SD:

In this version, we first assemble a list of row indices using the .I special symbol and subset those indices by ones that equal 1 and then take the first ([1]) by group. We then access just the indices with $V1 and subset the original dt by that.

dt[,.I[number == 1][1], by=group]
   group V1
1:     A  1
2:     B  3
3:     C  6

dt[dt[,.I[number == 1][1], by=group]$V1]
   group number
1:     A      1
2:     B      1
3:     C      1

Edit:

As @IceCreamToucan points out in the comments, another, easier to read option is with head.data.table:

dt[order(group,number), head(.SD, 1), by=group]
   group number
1:     A      1
2:     B      1
3:     C      1

edited Jun 02 '20 at 22:39

chinsoon12

25,005
4
25
35

answered Jun 02 '20 at 13:18

Ian Campbell

23,484
14
36
57

I don't understand the second version. Can you break it down a bit? – 9314197 Jun 02 '20 at 13:24
If your data is large, running `dt[order(group,number), head(.SD, 1), by=group]` is a quicker option that doesn't require breaking out `.I` and multiple `dt[`. According to [this answer](https://stackoverflow.com/questions/34753050/data-table-select-first-n-rows-within-group) `head` is on par with `.I` – IceCreamToucan Jun 02 '20 at 13:29
9314197 See my edit. Also, as @IceCreamToucan points out, `head.data.table` is very highly optimize and a surprisingly good option. – Ian Campbell Jun 02 '20 at 13:30
Ah, thanks both. The indexing option is clever, but I'll stick with head since it's easier to read. – 9314197 Jun 02 '20 at 13:34

Does unique.data.table with by behave like dplyr::distinct with .keep_all = TRUE?

2 Answers2