Find indices of duplicated rows

Question

Function duplicated in R performs duplicate row search. If we want to remove the duplicates, we need just to write df[!duplicated(df),] and duplicates will be removed from data frame.

But how to find the indices of duplicated data? If duplicated returns TRUE on some row, it means, that this is the second occurence of such a row in the data frame and its index can be easily obtained. How to obtain the index of first occurence of this row? Or, in other words, an index with which the duplicated row is identical?

I could make a loop on data.frame, but I think there is a more elegant answer on this question.

A nice method using dplyr: https://stackoverflow.com/a/28244567/ — stevec, Feb 08 '21 at 22:22
annndrey, why did you accept Sven's answer? It answers completely different question. — Tomas, Nov 21 '21 at 02:31
I can't make an answer to the question, but as the accepted answer doesn't answer the question, (It returns a vector which of True/False that can be used to subset the data frame), one solution to the original question is: `which(duplicated(df) | duplicated(df, fromLast = TRUE))`. Then you get the indices of duplicated rows. — OLGJ, Mar 01 '23 at 15:06

score 120 · Accepted Answer · edited Sep 14 '20 at 13:04

120

Here's an example:

df <- data.frame(a = c(1,2,3,4,1,5,6,4,2,1))

duplicated(df) | duplicated(df, fromLast = TRUE)
#[1]  TRUE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE

How it works?

The function duplicated(df) determines duplicate elements in the original data. The fromLast = TRUE indicates that "duplication should be considered from the reverse side". The two resulting logical vectors are combined using | since a TRUE in at least one of them indicates a duplicated value.

edited Sep 14 '20 at 13:04

Henrik

65,555
14
143
159

answered Sep 19 '12 at 13:13

Sven Hohenstein

80,497
17
145
168

9

This doesn't answer the question at all! – Tomas Nov 21 '21 at 02:30

score 20 · Answer 2 · answered Sep 24 '12 at 00:20

20

If you are using a keyed data.table, then you can use the following elegant syntax

library(data.table)
DT <- data.table(A = rep(1:3, each=4), 
                 B = rep(1:4, each=3), 
                 C = rep(1:2, 6), key = "A,B,C")

DT[unique(DT[duplicated(DT)]),which=T]

To unpack

DT[duplicated(DT)] subsets those rows which are duplicates.
unique(...) returns only the unique combinations of the duplicated rows. This deals with any cases with more than 1 duplicate (duplicate duplicates eg triplicates etc)
DT[..., which = T] merges the duplicate rows with the original, with which=T returning the row number (without which = T it would just return the data).

You could also use

 DT[,count := .N,by = list(A,B,C)][count>1, which=T]

answered Sep 24 '12 at 00:20

mnel

113,303
27
265
254

in second case, no need to set a key (and by is not less efficient without key). – pommedeterresautee Oct 12 '14 at 13:00
1

I really like this approach but it seems that the results of DT[duplicated(DT)] does not include the first row that is a duplicate, for example if I have three duplicates for one instance it will only show me two of them. How to see them all? – Herman Toothrot Mar 15 '16 at 16:05
You can use similar appraoch with `fromLast=TRUE`. Something like `DT[unique(DT[duplicated(DT) | duplicated(DT, fromLast = TRUE)]), which = TRUE]` – yuskam Jun 25 '20 at 08:16

Find indices of duplicated rows

2 Answers2

How it works?

Linked

Related