1

I have a data set (data.frame) composed mainly by duplicated rows. I´d like to subset it in order to get only the rows which didn´t have duplicates originally. Example of original df:

df1 <- data.frame( ID = c("1", "1","2","2","3","3","4","4"), V_1 = c(2011,2011,NA,NA,1,10,7,7), V_2= c(5,5,6,6,15,15,8,2))    

Answer I´d like to get:

df2 <- data.frame( ID = c("3","3","4","4"), V_1 = c(1,10,7,7), V_2= c(15,15,8,2))    

Using unique(df1) function gives me the complete list of unique rows (in the case of the example, 6). Since my df has more than 200k rows, it is not adequate for me.

I have searched for similar questions: Subset with unique cases, based on multiple columns, Filter rows based on multiple column conditions R, but none of them helped me.

Community
  • 1
  • 1
Gil33
  • 123
  • 1
  • 9

2 Answers2

2

Here you go

duplicated.i = duplicated(df1) | (duplicated(df1[nrow(df1):1,]))[nrow(df1):1]
df2 = df1[!duplicated.i,]

# ID V_1 V_2
# 5  3   1  15
# 6  3  10  15
# 7  4   7   8
# 8  4   7   2
1

We can remove the ID's of the duplicated rows and exclude them from the entire data frame

df1[!df1$ID %in% df1[duplicated(df1),1 ], ]

#  ID V_1 V_2
#5  3   1  15
#6  3  10  15
#7  4   7   8
#8  4   7   2
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213