pair-wise duplicate removal from dataframe

Question

This seems like a simple problem but I can't seem to figure it out. I'd like to remove duplicates from a dataframe (df) if two columns have the same values, even if those values are in the reverse order. What I mean is, say you have the following data frame:

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)

  a b
1 A A
2 A B
3 A B
4 B C
5 B A
6 B A
7 C B
8 C B

If I now remove duplicates, I get the following data frame:

df[duplicated(df),]

  a b
3 A B
6 B A
8 C B

However, I would also like to remove the row 6 in this data frame, since "A", "B" is the same as "B", "A". How can I do this automatically?

Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large.

Thanks!

an almost duplicate of http://stackoverflow.com/q/25145982/817778 - only difference being if you for some reason want to stay within `data.frame` (unlikely, especially given you say your data is large, but who knows) — eddi, Aug 13 '14 at 23:36
You're not removing duplicates with your current code, you're keeping duplicates. — thelatemail, Aug 13 '14 at 23:37
What I mean by removing duplicate is simply keeping a single occurrence of a repeated pattern across two columns and that is what I in fact am doing. — user3141121, Aug 13 '14 at 23:54

rnso · Answer 1 · 2018-02-20T17:19:23.437

8

Extending Ari's answer, to specify columns to check if other columns are also there:

a <- c(rep("A", 3), rep("B", 3), rep("C",2))
b <- c('A','B','B','C','A','A','B','B')
df <-data.frame(a,b)

df$c = sample(1:10,8)
df$d = sample(LETTERS,8)
df
  a b  c d
1 A A 10 B
2 A B  8 S
3 A B  7 J
4 B C  3 Q
5 B A  2 I
6 B A  6 U
7 C B  4 L
8 C B  5 V

cols = c(1,2)
newdf = df[,cols]
for (i in 1:nrow(df)){
    newdf[i, ] = sort(df[i,cols])
}

df[!duplicated(newdf),]
  a b c d
1 A A 8 X
2 A B 7 L
4 B C 2 P

edited Feb 20 '18 at 17:19

answered Aug 14 '14 at 01:10

rnso

23,686
25
112
234

This is the only answer that addresses the full request of the OP – Monica Heddneck Jul 29 '16 at 04:13

score 4 · Accepted Answer · answered Aug 14 '14 at 00:06

4

One solution is to first sort each row of df:

for (i in 1:nrow(df))
{
    df[i, ] = sort(df[i, ])
}
df

a b
1 A A
2 A B
3 A B
4 B C
5 A B
6 A B
7 B C
8 B C

At that point it's just a matter of removing the duplicated elements:

df = df[!duplicated(df),]
df
  a b 
1 A A
2 A B
4 B C

As thelatemail mentioned in the comments, your code actualy keeps the duplicates. You need to use !duplicated to remove them.

answered Aug 14 '14 at 00:06

Ari

1,819
14
22

1

Ari, you've been so helpful with R-related questions. Thanks for your help! – Shane H Aug 14 '14 at 00:23
1

The OP says, "Ideally I could specify which two columns to compare since the data frames could have varying columns and can be quite large." I don't see that being addressed in this solution. – Monica Heddneck Jul 29 '16 at 04:11

score 4 · Answer 3 · edited May 23 '17 at 10:30

The other answers use a for loop to assign a value for each and every row. While this is not an issue if you have 100 rows, or even a thousand, you're going to be waiting a while if you have large data of the order of 1M rows.

Stealing from the other linked answer using data.table, you could try something like:

df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),]

A comparison benchmark with a larger dataset (df2):

df2 <- df[sample(1:nrow(df),50000,replace=TRUE),]

system.time(
  df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),]
)
# user  system elapsed 
# 0.07    0.00    0.06 

system.time({
  for (i in 1:nrow(df2))
  {
      df2[i, ] = sort(df2[i, ])
  }
  df2[!duplicated(df2),]
}
)
#   user  system elapsed 
#  42.07    0.02   42.09

score 3 · Answer 4 · edited Dec 23 '16 at 17:40

3

Using apply will be a better option than loops.

newDf <- data.frame(t(apply(df,1,sort)))

All you need to do now is remove duplicates.

newDf <- newDf[!duplicated(newDf),]

edited Dec 23 '16 at 17:40

approxiblue

6,982
16
51
59

answered Feb 07 '16 at 05:55

Awanindra Singh

31
2

Apply is a great utility. But I think this code will need to be modified if more than 2 concerned columns are there which also need to be maintained. – rnso Jul 29 '16 at 06:44

pair-wise duplicate removal from dataframe

4 Answers4

Linked

Related