Identify duplicate together with original observation in R (maybe by clustering)

Question

I have the suspicion that respondents are cheating. I have found duplicate answers, but if I only use the duplicated() and/or unique() function, I only get either the duplicates (without origin) or the unique values (without the duplicates). I want to know which one are duplicates from which observations. Is there a function in R with which I can easily find which observations have the same answer pattern?

#df
id <- c("l","l","l","p","p","a","a","a")
show <- c("broadway","cats","alladin","broadway","cats","broadway","cats","alladin")
v1 <- c(1,2,2,1,3,1,2,1)
v2 <- c(1,2,2,2,4,1,2,3)
v3 <- c(1,2,2,5,1,1,2,4)
df <- data.frame(id,show,v1,v2,v3); df

Here's the script I've used to identify the duplicates. I am only interested in duplicates occurring in the numerical part of the dataframe, hence I only select columns 3 to 5.

#script I'm using to find duplicates
duplicates <- data.frame(which(duplicated(df[,3:5])))

This question is not a duplicate of Identify duplicates and mark first occurrence and all others, because I am not interested in a binary output. A solution that would be of great help to me is if I could identify which clusters of duplicates exist. In this case, df[6,] is a duplicate of df[1,] (cluster 1) and df[3,] and df[7,] are duplicates of df[2,] (cluster 2)

Using Wietze's solution with the dplyr package led to a good solution:

 library(dplyr) 
 df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

Since I am not really familiar with the grammar that is used in dplyr, I have one more question. It looks like a column (n) is added at the end of the dataframe, but if I save the function as an object and ask for the final column, it doesn't return to me the n. How can I, using this solution, find my way back to my original dataframe with the n column added? This is how my desired output would look like using the dplyr package:

id_e <- c("l","l","l","a","a")
show_e <- c("broadway","cats","alladin","broadway","cats")
v1_e <- c(1,2,2,1,2)
v2_e <- c(1,2,2,1,2)
v3_e <- c(1,2,2,1,2)
dup_cluster <- c(2,3,3,2,3)
df_expected <- data.frame(id_e,show_e,v1_e,v2_e,v3_e,dup_cluster); df_expected

I post here the two solutions that worked for me. Solution 1:

 df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1) #add an extra column with a new category
 df <- dataframe(df) #transform back into dataframe

Solution 2:

setDT(df)[, .(.N, id, show) , by=.(v1,v2,v3)][N>1,]

@nrussell, I don't think this is a duplicate question, because there are multiple clusters of duplicates here. df[6,] is a duplicate of df[1,] (cluster 1) and df[3,] and df[7,] are duplicates of df[2,] (cluster 2) — SHW, Oct 10 '16 at 14:43
Sorry about that, still learning Stack Overflow :). I just added an expected output. — SHW, Oct 10 '16 at 15:34
In base R you could simply use a combo of `duplicated`and `duplicated` with `fromLast = TRUE`: `df[duplicated(df[,3:5]) | duplicated(df[,3:5], fromLast = TRUE),]` — Jaap, Oct 10 '16 at 15:42

score 3 · Answer 1 · answered Oct 10 '16 at 15:09

3

With library(data.table) we can do

setDT(df)[, .(.N, id, show) , by=.(v1,v2,v3)][N>1,]

answered Oct 10 '16 at 15:09

dww

30,425
5
68
111

score 2 · Accepted Answer · edited Oct 10 '16 at 15:26

2

Using dplyr package:

library(dplyr) 


#filter on n, do not create new column
df %>% group_by(v1, v2, v3) %>% filter(n() > 1)

#filter on n, create new column
df %>% group_by(v1, v2, v3) %>% mutate(n = n()) %>% filter(n > 1)

edited Oct 10 '16 at 15:26

zx8754

52,746
12
114
209

answered Oct 10 '16 at 15:03

Wietze314

5,942
2
21
40

Yes!!!! This is totally what I need. I have a bigger dataset and am now going to match this function to that one. Might have some questions later, but for now: Thank you so much! :) – SHW Oct 10 '16 at 15:14
Ok, so here it comes. I am not really familiar with the grammar that is used in dplyr. It looks like a column (n) is added at the end of the dataframe, but if I save the function as an object and ask for the final column, it doesn't return to me the n. How can I, using this solution, find my way back to my original dataframe with the n column added? I have provided a specification in my original question. – SHW Oct 10 '16 at 15:24
Ok, thanks guys. I have uploaded my original question so that it also involves the final solution. – SHW Oct 10 '16 at 15:42

Identify duplicate together with original observation in R (maybe by clustering)

2 Answers2