I want to get unique rows in a data frame based on one variable, while still choosing which rows (based on other variables) are included.
Example:
dt <- as.data.table(list(group = c("A", "A", "B", "B", "C", "C"), number = c(1, 2, 1, 2, 2, 1)))
I would normally do this, as it allows me to always keep the row where number == 1
.
dt %>%
arrange(group, number) %>%
distinct(group, .keep_all = TRUE)
This is now too slow, and I'm hoping the data.table equivalent will be faster.
This seems to work:
dt <- dt[order(group, number)]
unique(dt, by = c("group"))
But I couldn't find anything in the unique.data.table documentation which says that the first row per group is the one which is kept. Is it safe to assume it is?