Efficient way to remove duplicate rows in data.table

Question

It is straightforward to filter a data.table for unique or duplicated rows. An example for that is provided in Filtering out duplicated/non-unique rows in data.table. I would like to know if there's a more efficient way than reassigning the data.table to a new object, which has the duplicated entries removed?

library(data.table)
dt <- data.table(
  V1=LETTERS[c(1,1,1,1,2,3,3,5,7,1)],
  V2=LETTERS[c(2,3,4,2,1,4,4,6,7,2)]
)
filtered.dt <- unique(dt, by = "V2")

Is there perhaps a more efficient way than this reassignment?

So far you can't delete rows by reference in `data.table` (if that was the correct interpretation of your question). See [How to delete a row by reference in data.table?](https://stackoverflow.com/questions/10790204/how-to-delete-a-row-by-reference-in-data-table), and corresponding open issue: [Delete rows by reference](https://github.com/Rdatatable/data.table/issues/635). — Henrik, Nov 23 '18 at 12:38
Yes, that would be the correct interpretation, thanks for the link and the information. Marking it as a duplicate, but I think it's probably good to keep the question with the different heading. — hannes101, Nov 23 '18 at 12:40
@hannes101 is ``dt <- dt[!duplicated(dt),]`` also inefficient? — runr, Nov 23 '18 at 14:29
What I mean by inefficient is the reassignment of the data.table, which always happens, when using <-. Additionally, I think this is just another way to select the unique values, which should be equivalent. — hannes101, Nov 23 '18 at 14:35

Efficient way to remove duplicate rows in data.table

0 Answers0