Find out if column in R table includes duplicate values?

Question

I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.

Is there a way I can ask? For simplicity, let's pretend this is my data:

  var1 var2 var3
1    1    A    1
2    2    B    3
3    3    C   NA
4    4    D   NA
5    5    E    4

and I want to know whether var1 ever repeats.

score 23 · Accepted Answer · answered Nov 27 '12 at 22:07

23

Check out the duplicated function:

duplicated(dat$var1) # the rows of dat var1 duplicated

Documentation is here.

You should also look at the unique function.

answered Nov 27 '12 at 22:07

Erik Shilts

4,389
2
26
51

6

The documentation also mentions `anyDuplicated` which might be more directly relevant. – joran Nov 27 '12 at 22:08
4

@Joran it should be pointed out that `any(duplicated(dat$var1))` will give a T/F value, where as `anyDuplicated(dat$var1)` will give an index/0 value. – Ricardo Saporta Nov 27 '12 at 22:45
anyDuplicated it is. So now it turns out that `anyDuplicated(j)` returns 2039, which is exactly what `anyDuplicated(j$should_be_unique)` returns. This is out of 81,000 records. I can produce a matrix object of TRUE/FALSE but can't examine that to see what some of those 2039 are. New question? – Amanda Nov 27 '12 at 22:57
1

@RicardoSaporta I think I found the answer to that one: http://stackoverflow.com/questions/6986657/find-duplicated-column-pairs-in-data-frame-in-r?rq=1 – Amanda Nov 27 '12 at 23:00
incidentally, based on your other questions, this reference might be helpful: http://cran.r-project.org/doc/contrib/Short-refcard.pdf – Ricardo Saporta Nov 27 '12 at 23:11
Hah. that's on my desk right now but I buried it after it couldn't tell me about colClasses. – Amanda Nov 27 '12 at 23:17
3

Note that `anyDuplicated` returns the index of the first duplicate, not a count of duplicates. The whole point of using it instead of any(duplicated(...)) is that it's faster to return a positive as it will stop at the first. – Charles Nov 28 '12 at 14:08
For future searchers - I find that `table(dat$var1)` gives me the summary I intuitively want: how many duplicate values are there in this column. In other words, is this a problem, and if so, how big is it? – Dave Jul 15 '21 at 19:24
You can use `sum(duplicated(dat$var1))` to get a count of duplicates. – d84_n1nj4 Nov 12 '21 at 14:33
Instead of values, if you have dates, for instance in `as.Date` what changes in the code do you need to do to find duplicate dates? – Cláudio Siva May 20 '22 at 13:33

score 3 · Answer 2 · edited Oct 10 '19 at 17:40

3

Remove duplicates based on columns:

my_data[!duplicated(my_data$Col_id), ]  # Where ! is a logical negation:

edited Oct 10 '19 at 17:40

Yaakov Bressler

9,056
2
45
69

answered Oct 10 '19 at 12:56

Sami Navesi

160
1
6

Find out if column in R table includes duplicate values?

2 Answers2