14

I've got a lovely dataframe, my very first, and I'm starting to get the hang of R. One thing I haven't been able to find is a test for duplicate values. I have one column that I'm pretty sure is all unique values, but I don't know that.

Is there a way I can ask? For simplicity, let's pretend this is my data:

  var1 var2 var3
1    1    A    1
2    2    B    3
3    3    C   NA
4    4    D   NA
5    5    E    4

and I want to know whether var1 ever repeats.

Amanda
  • 12,099
  • 17
  • 63
  • 91

2 Answers2

23

Check out the duplicated function:

duplicated(dat$var1) # the rows of dat var1 duplicated

Documentation is here.

You should also look at the unique function.

Erik Shilts
  • 4,389
  • 2
  • 26
  • 51
  • 6
    The documentation also mentions `anyDuplicated` which might be more directly relevant. – joran Nov 27 '12 at 22:08
  • 4
    @Joran it should be pointed out that `any(duplicated(dat$var1))` will give a T/F value, where as `anyDuplicated(dat$var1)` will give an index/0 value. – Ricardo Saporta Nov 27 '12 at 22:45
  • anyDuplicated it is. So now it turns out that `anyDuplicated(j)` returns 2039, which is exactly what `anyDuplicated(j$should_be_unique)` returns. This is out of 81,000 records. I can produce a matrix object of TRUE/FALSE but can't examine that to see what some of those 2039 are. New question? – Amanda Nov 27 '12 at 22:57
  • 1
    @RicardoSaporta I think I found the answer to that one: http://stackoverflow.com/questions/6986657/find-duplicated-column-pairs-in-data-frame-in-r?rq=1 – Amanda Nov 27 '12 at 23:00
  • incidentally, based on your other questions, this reference might be helpful: http://cran.r-project.org/doc/contrib/Short-refcard.pdf – Ricardo Saporta Nov 27 '12 at 23:11
  • Hah. that's on my desk right now but I buried it after it couldn't tell me about colClasses. – Amanda Nov 27 '12 at 23:17
  • 3
    Note that `anyDuplicated` returns the index of the first duplicate, not a count of duplicates. The whole point of using it instead of any(duplicated(...)) is that it's faster to return a positive as it will stop at the first. – Charles Nov 28 '12 at 14:08
  • For future searchers - I find that `table(dat$var1)` gives me the summary I intuitively want: how many duplicate values are there in this column. In other words, is this a problem, and if so, how big is it? – Dave Jul 15 '21 at 19:24
  • You can use `sum(duplicated(dat$var1))` to get a count of duplicates. – d84_n1nj4 Nov 12 '21 at 14:33
  • Instead of values, if you have dates, for instance in `as.Date` what changes in the code do you need to do to find duplicate dates? – Cláudio Siva May 20 '22 at 13:33
3

Remove duplicates based on columns:

my_data[!duplicated(my_data$Col_id), ]  # Where ! is a logical negation:
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
Sami Navesi
  • 160
  • 1
  • 6