Removing duplicate rows from data frame in R

Question

I have two columns, would like to retain only the non commutative rows.For the data below my output should contain one combination of (1 2). i.e. for my query (1 2) is same as (2 1). Is there a simple way to do it in R. Already tried transposing. and retaining the upper traingular matrix. but it becomes a pain re transposing back the data.

A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 1 0.3
2 3 0.1
2 4 0.4

My final output should be:

A B prob
1 2 0.1
1 3 0.2
1 4 0.3
2 3 0.1
2 4 0.4

I am not sure if this is what you want, but my first thought was to remove just all rows where `B` is smaller than `A`. — Daniel Fischer, Jun 30 '16 at 07:48
unfortunately in my actual problem columns a and b are not numbers, they are product codes — Mukul, Jun 30 '16 at 08:02

bgoldst · Answer 1 · 2016-06-30T08:46:56.070

We can independently sort() each row and then use !duplicated() to find which rows to preserve:

df[!duplicated(t(apply(df[1:2],1L,sort))),];
##   A B prob
## 1 1 2  0.1
## 2 1 3  0.2
## 3 1 4  0.3
## 5 2 3  0.1
## 6 2 4  0.4

Data

df <- data.frame(A=c(1L,1L,1L,2L,2L,2L),B=c(2L,3L,4L,1L,3L,4L),prob=c(0.1,0.2,0.3,0.3,0.1,0.4
));

Explanation

The first step is to extract just the two columns of interest:

df[1:2];
##   A B
## 1 1 2
## 2 1 3
## 3 1 4
## 4 2 1
## 5 2 3
## 6 2 4

Then we independently sort each row with apply() and sort():

apply(df[1:2],1L,sort);
##      [,1] [,2] [,3] [,4] [,5] [,6]
## [1,]    1    1    1    1    2    2
## [2,]    2    3    4    2    3    4

As you can see, apply() returns its results in an unexpected transposition, so we have to fix it with t() to prepare for the upcoming duplicated() call:

t(apply(df[1:2],1L,sort));
##      [,1] [,2]
## [1,]    1    2
## [2,]    1    3
## [3,]    1    4
## [4,]    1    2
## [5,]    2    3
## [6,]    2    4

Now we can use duplicated() to get a logical vector indicating which rows are duplicates of previous rows:

duplicated(t(apply(df[1:2],1L,sort)));
## [1] FALSE FALSE FALSE  TRUE FALSE FALSE

We then invert the logical vector with a negation, to get just those rows that are not duplicates of any previous rows:

!duplicated(t(apply(df[1:2],1L,sort)));
## [1]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE

Finally we can use the resulting logical vector to index out just those rows of df that are not duplicates of any previous rows:

df[!duplicated(t(apply(df[1:2],1L,sort))),];
##   A B prob
## 1 1 2  0.1
## 2 1 3  0.2
## 3 1 4  0.3
## 5 2 3  0.1
## 6 2 4  0.4

Therefore, the first occurrence of every set of post-sort duplicates will be retained, the remainder will be removed.

Excellent suggestion from @RichardScriven; we can replace the t() call with the MARGIN argument of duplicated(), which will likely be slightly faster:

df[!duplicated(apply(df[1:2],1L,sort),MARGIN=2L),];
##   A B prob
## 1 1 2  0.1
## 2 1 3  0.2
## 3 1 4  0.3
## 5 2 3  0.1
## 6 2 4  0.4

it has worked , thanks a lot , can you please also explain what is happening at the backend. Which occurence will it retain 1 2 one or 2 1 , i would want to retain the 1 2 one. — Mukul, Jun 30 '16 at 08:30
Please have a look at my answer. Basically the script of bgoldst works in the backend in the same way. Except of the `paste` part. Only the FIRST occurence will be returned. — FlorianSchunke, Jun 30 '16 at 08:35
You actually don't need to transpose. You can do `duplicated(apply(df[-3], 1, sort), MARGIN = 2)` to check duplication over the columns. But either way works. — Rich Scriven, Jun 30 '16 at 08:39
Woah, Thanks a lot Florian for the detailed explanation. I wonder when was the last time I had this much clear context of the backend of R. And thanks bgoldst for this awesome snippet. Richard you saved a little more code.Thanks to you too. :) — Mukul, Jun 30 '16 at 08:47

akrun · Accepted Answer · 2016-06-30T08:31:32.177

We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by the pmin(A, B) and pmax(A,B), if the number of rows is greater than 1, we get the first row or else return the rows.

 library(data.table)
 setDT(df1)[, if(.N >1) head(.SD, 1) else .SD ,.(A=pmin(A, B), B= pmax(A, B))]
 #   A B prob
 #1: 1 2  0.1
 #2: 1 3  0.2
 #3: 1 4  0.3
 #4: 2 3  0.1
 #5: 2 4  0.4

Or we can just used duplicated on the pmax, pmin output to return a logical index and subset the data based on that.

 setDT(df1)[!duplicated(cbind(pmax(A, B), pmin(A, B)))]
 #   A B prob
 #1: 1 2  0.1
 #2: 1 3  0.2
 #3: 1 4  0.3
 #4: 2 3  0.1
 #5: 2 4  0.4

FlorianSchunke · Answer 3 · 2016-06-30T08:22:17.783

This should work:

d <- data.frame(A=rep(1:2, each=4), B=rep(1:4, 2), p=rnorm(n=8))
> d
  A B           p
1 1 1 -1.26282557
2 1 2 -0.03627707
3 1 3  1.50063527
4 1 4 -0.30038114
5 2 1 -0.01509190
6 2 2  0.13634069
7 2 3 -0.39612927
8 2 4 -0.10895007
l <- 1:nrow(d) # Create an index vector
v <- apply(cbind(d$A, d$B), 1, sort) # Make (1, 2) look like (2, 1)
v <- paste(v[1,], v[2,]) # Create vector where (1, 2) and (2, 1) both look like '1 2'
fun <- function(x) return(x[1]) # Function for tapply to only return the first match for the pattern
i <- tapply(l, v, fun) # get relevant index
res <- d[i, ] # Create result vektor
> res
  A B          p
1 1 1 -0.6742351
2 1 2 -1.5895396
3 1 3 -1.5975784
4 1 4 -1.4764792
6 2 2 -0.1682946
7 2 3 -0.5799141
8 2 4  2.4104019

Be aware that this will use the FIRST occurence of the pattern.

score 0 · Answer 4 · answered Jun 30 '16 at 10:19

Here is another solution using base R. The idea is to search in the second half of the df (using sapply) if there are any duplicated there. We then get back secondHalf vector. We further remove those rows from df.

n <- nrow(df)
secondHalf <- sapply(seq(n/2), function(i) nrow(df[df$A==df[i,2] & df$B==df[i,1],]))
# [1] 1 0 0
toRemove <- ((floor(n/2)+1):n)*secondHalf
df <- df[-toRemove,]

  # A B prob
# 1 1 2  0.1
# 2 1 3  0.2
# 3 1 4  0.3
# 5 2 3  0.1
# 6 2 4  0.4

Removing duplicate rows from data frame in R

4 Answers4

Explanation

Linked

Related