Find duplicate values in R

Question

I have a table with 21638 unique* rows:

vocabulary <- read.table("http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Vocabulary.txt", header=T)

This table has five columns, the first of which holds the respondent ID numbers. I want to check if any respondents appear twice, or if all respondents are unique.

To count unique IDs I can use

length(unique(vocabulary$id))

and to check if there are any duplicates I might do

length(unique(vocabulary$id)) == nrow(vocabulary)

which returns TRUE, if there are no duplicates (which there aren't).

My question:

Is there a direct way to return the values or line numbers of duplicates?

Some further explanation:

There is an interpretation problem with using the function duplicated(), because is only returns the duplicates in the strict sense, excluding the "originals". For example, sum(duplicated(vocabulary$id)) or dim(vocabulary[duplicated(vocabulary$id),])[1] might return "5" as the number of duplicate rows. The problem is that if you only know the number of duplicates, you won't know how many rows they duplicate. Does "5" mean that there are five rows with one duplicate each, or that there is one row with five duplicates? And since you won't have the IDs or line numbers of the duplicates, you wouldn't have any means of finding the "originals".

*I know there are no duplicate IDs in this survey, but it is a good example, because using any of the answers given elsewhere to this question, like duplicated(vocabulary$id) or table(vocabulary$id) will output a haystack to your screen in which you'll be quite unable to find any possible rare duplicate needles.

This question and it's answers are unlike the question listed as a duplicate. This question asks to return the values that are duplicates. The "duplicate" question posted seems to just remove duplicates, so you don't know which values/rows they are. — Rick Henderson, Feb 09 '18 at 21:08
I agree - the current question is not a duplicate of the marked question. The current question in fact only asks to return duplicate values of a vector (only `vocabulary$id` is there to check as we see in the chosen answer). The chosen answer won't help if we want to check whole dataframe rows for duplicates and the `$id` column is not there. The marked question on the other hand asks how to return duplicate rows of a dataframe. — hanna, Oct 22 '19 at 15:21

score 185 · Accepted Answer · answered Jun 03 '13 at 23:35

185

You could use table, i.e.

n_occur <- data.frame(table(vocabulary$id))

gives you a data frame with a list of ids and the number of times they occurred.

n_occur[n_occur$Freq > 1,]

tells you which ids occurred more than once.

vocabulary[vocabulary$id %in% n_occur$Var1[n_occur$Freq > 1],]

returns the records with more than one occurrence.

answered Jun 03 '13 at 23:35

Kikapp

2,283
1
15
7

1

Thank you all for your answers. I like them all (so +1 for each), and I learned a lot from each of them. I'm choosing this answer, because (1) it works without installing a new library and (2) because I find the logic behind it simple and elegant. – Jun 05 '13 at 12:48
1

now that you have found the duplicated rows by id. can you now add a suffix like 234.A, 234.B etc. which will make them unique – MySchizoBuddy Oct 08 '15 at 19:33
1

To make values unique you can use the function make.unique from stringr package. – Vasile Dec 09 '16 at 09:45
3

an alternative using dplyr to identify the duplicated `var`s. `df %>% group_by(var) %>% summarise(n=sum(n())) %>% filter(n>1) ` – Jessica Burnett Dec 30 '21 at 15:08
@JessicaBurnett More simply : `df %>% group_by(var) %>% summarise(n=n()) %>% filter(n>1)` – Julien May 31 '23 at 12:05

Pierre Lapointe · Answer 2 · 2013-06-03T22:04:08.677

This will give you duplicate rows:

vocabulary[duplicated(vocabulary$id),]

This will give you the number of duplicates:

dim(vocabulary[duplicated(vocabulary$id),])[1]

Example:

vocabulary2 <-rbind(vocabulary,vocabulary[1,]) #creates a duplicate at the end
vocabulary2[duplicated(vocabulary2$id),]
#            id year    sex education vocabulary
#21639 20040001 2004 Female         9          3
dim(vocabulary2[duplicated(vocabulary2$id),])[1]
#[1] 1 #=1 duplicate

EDIT

OK, with the additional information, here's what you should do: duplicated has a fromLast option which allows you to get duplicates from the end. If you combine this with the normal duplicated, you get all duplicates. The following example adds duplicates to the original vocabulary object (line 1 is duplicated twice and line 5 is duplicated once). I then use table to get the total number of duplicates per ID.

#Create vocabulary object with duplicates
voc.dups <-rbind(vocabulary,vocabulary[1,],vocabulary[1,],vocabulary[5,])

#List duplicates
dups <-voc.dups[duplicated(voc.dups$id)|duplicated(voc.dups$id, fromLast=TRUE),]
dups
#            id year    sex education vocabulary
#1     20040001 2004 Female         9          3
#5     20040008 2004   Male        14          1
#21639 20040001 2004 Female         9          3
#21640 20040001 2004 Female         9          3
#51000 20040008 2004   Male        14          1

#Count duplicates by id
table(dups$id)
#20040001 20040008 
#       3        2

For counting the number of dups, I was planning on suggesting `sum(duplicated(vocabulary$id))`, which seems simpler to me. — joran, Jun 03 '13 at 20:49
This approach is _significantly_ faster than the one using the table in the accepted answer. — Christian, Nov 28 '19 at 09:30

score 32 · Answer 3 · edited Oct 06 '18 at 13:55

32

Here, I summarize a few ways which may return different results to your question, so be careful:

# First assign your "id"s to an R object.
# Here's a hypothetical example:
id <- c("a","b","b","c","c","c","d","d","d","d")

#To return ALL MINUS ONE duplicated values:
id[duplicated(id)]
## [1] "b" "c" "c" "d" "d" "d"

#To return ALL duplicated values by specifying fromLast argument:
id[duplicated(id) | duplicated(id, fromLast=TRUE)]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"

#Yet another way to return ALL duplicated values, using %in% operator:
id[ id %in% id[duplicated(id)] ]
## [1] "b" "b" "c" "c" "c" "d" "d" "d" "d"

Hope these help. Good luck.

edited Oct 06 '18 at 13:55

krads

1,350
8
14

answered Oct 01 '16 at 15:14

David C.

1,974
2
19
29

Nice. For the last approach, the call to unique() is not needed. – cumin Sep 07 '17 at 14:06
1

@David C. could you explain the third approach `duplicated(id) | duplicated(id, fromLast=TRUE)` ? – moth Jan 18 '19 at 21:33
1

@AlexandreMondaini `duplicate` assigns the second, third, fourth... occurrences as TRUE. To mark all duplicated elements as TRUE, you need to count backwards as well. The *logical OR* operator inclusively allows that. Try a simple example: `x = c(1,1); duplicated(x); duplicated(x, fromLast=TRUE)` – David C. Jan 23 '19 at 02:44
I tested the different options from David C.: `id[ id %in% id[duplicated(id)] ]` is about 15% faster than the 'or' version on my dataset. It would be nice if package dplyr added a function like: `repeated <- function(.x){ .x %in% .x[duplicated(.x)] }` since you have to compare all duplicates so often to de-duplicate a dataset. – Hayward Oblad Apr 27 '23 at 13:35

score 11 · Answer 4 · answered Jun 03 '13 at 22:06

11

Here's a data.table solution that will list the duplicates along with the number of duplications (will be 1 if there are 2 copies, and so on - you can adjust that to suit your needs):

library(data.table)
dt = data.table(vocabulary)

dt[duplicated(id), cbind(.SD[1], number = .N), by = id]

answered Jun 03 '13 at 22:06

eddi

49,088
6
104
155

1

I like this answer... but as a novice R user I'm trying to figure out what the .SD[1] and .N means? – Vance L Albaugh Jan 28 '17 at 14:59
2

@VanceLAlbaugh I'd start here - https://rawgit.com/wiki/Rdatatable/data.table/vignettes/datatable-intro.html – eddi Jan 30 '17 at 17:40

Louis Maddox · Answer 5 · 2016-03-03T19:07:19.760

A terser way, either with rev :

x[!(!duplicated(x) & rev(!duplicated(rev(x))))]

... rather than fromLast:

x[!(!duplicated(x) & !duplicated(x, fromLast = TRUE))]

... and as a helper function to provide either logical vector or elements from original vector :

duplicates <- function(x, as.bool = FALSE) {
    is.dup <- !(!duplicated(x) & rev(!duplicated(rev(x))))
    if (as.bool) { is.dup } else { x[is.dup] }
}

Treating vectors as data frames to pass to table is handy but can get difficult to read, and the data.table solution is fine but I'd prefer base R solutions for dealing with simple vectors like IDs.

Find duplicate values in R

5 Answers5

EDIT

Linked

Related