Is there a more elegant way to find duplicated records?

Question

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`

But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?

In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.

score 2 · Accepted Answer · answered Nov 27 '12 at 23:17

2

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma

You can put it all together in one line

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows

answered Nov 27 '12 at 23:17

Ricardo Saporta

54,400
17
144
178

Thanks. Will test later tonight. Time for a break from my headfirst flounding dive into R. – Amanda Nov 27 '12 at 23:20
I wound up with this: `dupes <- df[duplicated(df),c('last','first','external_id')]` and then `dupes.unique <- unique(dupes)` -- as it turns out some of the duplicates appear 10 or 12 times. With that I can go back to my source and confirm that I didn't introduce the duplication. – Amanda Nov 29 '12 at 03:41

score -1 · Answer 2 · edited Nov 28 '12 at 08:00

-1

doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]

Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

edited Nov 28 '12 at 08:00

Roland

127,288
10
191
288

answered Nov 27 '12 at 23:54

Antishatter

1

2

You can avoid the `which` call and use `!` instead of `-` – mnel Nov 28 '12 at 00:05
changing = to <- in this scenario doesn't seem necessary. – Antishatter Nov 28 '12 at 00:11
I was really just changing the capitilized `Which` to `which`. The rest was to make the edit valid. – mnel Nov 28 '12 at 00:13
1

You should probably change `=` to `==` too. – Gregor Thomas Nov 28 '12 at 01:08
That doesnt really make sense as `==` is for comparing vectors while `=` in most instances will do the same thing as `->` – Antishatter Nov 28 '12 at 16:17
1

You could also remove the redundant `== TRUE` – hadley Nov 28 '12 at 16:56
1

@antishatter Google's R Style Guide does prefer `<-` to `=` and as someone who is just getting my bearings here I may as well learn to do it "right" http://google-styleguide.googlecode.com/svn/trunk/google-r-style.html#assignment – Amanda Nov 29 '12 at 02:59

Is there a more elegant way to find duplicated records?

2 Answers2

Linked