Removing duplicate rows on the basis of specific columns

Question

How can I remove the duplicate rows on the basis of specific columns while maintaining the dataset. I tried using these links1, link2

What I want to do is I want to see the ambiguity on the basis of column 3 to 6. If their values are same then the processed dataset should remove the rows, as shown in the example:

I used this code but I gave me half result:

Data <- unique(Data[, 3:6])

Lets suppose my dataset is like this

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  2  2  1  5  4  12 A  2 35  36 22 21
 1  22 32 31 5 34  12 A  3  5  6  2  1

What I want in my output is:

 A  B  C  D  E  F  G  H  I  J  K  L  M
 1  2  2  1  5  4  12 A  3  5  6  2  1
 1  22 32 31 5 34  12 A  3  5  6  2  1

RHertel · Answer 1 · 2015-08-07T06:47:28.193

Assuming that your data is stored as a dataframe, you could try:

Data <- Data[!duplicated(Data[,3:6]),]
#> Data
#  A  B  C  D E  F  G H I J K L M
#1 1  2  2  1 5  4 12 A 3 5 6 2 1
#3 1 22 32 31 5 34 12 A 3 5 6 2 1

The function duplicated() returns a logical vector containing in this case information for each row about whether the combination of the entries in column 3 to 6 reappears elsewhere in the dataset. The negation ! of this logical vector is used to select the rows from your dataset, resulting in a dataset with unique combinations of the entries in column 3 to 6.

Thanks to @thelatemail for pointing out a mistake in my previous post.

akrun · Accepted Answer · 2015-08-07T06:33:46.650

2

Another option is unique from data.table. It has the by option. We convert the 'data.frame' to 'data.table' (setDT(df1)), use unique and specify the columns within the by

 library(data.table)
 unique(setDT(df1), by= names(df1)[3:6])
 #   A  B  C  D E  F  G H I J K L M
 #1: 1  2  2  1 5  4 12 A 3 5 6 2 1
 #2: 1 22 32 31 5 34 12 A 3 5 6 2 1

unique returns a data.table with duplicated rows removed.

edited Aug 07 '15 at 06:33

answered Aug 07 '15 at 06:18

akrun

874,273
37
540
662

@ayush What is the other question – akrun Aug 09 '15 at 14:56
I have already have the dummy solution but it isn't accepting in my original dataset. I tried every possible permutation in my code but it won't work. Can I mail you the ques? or you can ping me over mail so that i can do that. – ayush Aug 09 '15 at 15:01
@ayush I am using sim to connect to the net. Downloading big datasets is costly for me. Can't you provide a dummy example that mimics your original dataset as a new post – akrun Aug 09 '15 at 15:02

Removing duplicate rows on the basis of specific columns

2 Answers2