2

I've got 81,000 records in my test frame, and duplicated is showing me that 2039 are identical matches. One answer to Find duplicated rows (based on 2 columns) in Data Frame in R suggests a method for creating a smaller frame of just the duplicate records. This works for me, too:

dup <- data.frame(as.numeric(duplicated(df$var))) #creates df with binary var for duplicated rows
colnames(dup) <- c("dup") #renames column for simplicity
df2 <- cbind(df, dup) #bind to original df
df3 <- subset(df2, dup == 1) #subsets df using binary var for duplicated`

But it seems, as the poster noted, inelegant. Is there a cleaner way to get the same result: a view of just those records that are duplicates?

In my case I'm working with scraped data and I need to figure out whether the duplicates exist in the original or were introduced by me scraping.

Community
  • 1
  • 1
Amanda
  • 12,099
  • 17
  • 63
  • 91

2 Answers2

2

duplicated(df) will give you a logical vector (all values consisting of either T/F), which you can then use as an index to your dataframe rows.

# indx will contain TRUE values wherever in df$var there is a duplicate
indx <- duplicated(df$var)
df[indx, ]  #note the comma 

You can put it all together in one line

df[duplicated(df$var), ]  # again, the comma, to indicate we are selected rows
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Thanks. Will test later tonight. Time for a break from my headfirst flounding dive into R. – Amanda Nov 27 '12 at 23:20
  • I wound up with this: `dupes <- df[duplicated(df),c('last','first','external_id')]` and then `dupes.unique <- unique(dupes)` -- as it turns out some of the duplicates appear 10 or 12 times. With that I can go back to my source and confirm that I didn't introduce the duplication. – Amanda Nov 29 '12 at 03:41
-1
doops <- which(duplicated(df$var)==TRUE)
uniques <- df[-doops,]
duplicates <- df[doops,]

Is the logic I generally use when I am trying to remove the duplicate entrys from a data frame.

Roland
  • 127,288
  • 10
  • 191
  • 288