We can independently sort()
each row and then use !
duplicated()
to find which rows to preserve:
df[!duplicated(t(apply(df[1:2],1L,sort))),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4
Data
df <- data.frame(A=c(1L,1L,1L,2L,2L,2L),B=c(2L,3L,4L,1L,3L,4L),prob=c(0.1,0.2,0.3,0.3,0.1,0.4
));
Explanation
The first step is to extract just the two columns of interest:
df[1:2];
## A B
## 1 1 2
## 2 1 3
## 3 1 4
## 4 2 1
## 5 2 3
## 6 2 4
Then we independently sort each row with apply()
and sort()
:
apply(df[1:2],1L,sort);
## [,1] [,2] [,3] [,4] [,5] [,6]
## [1,] 1 1 1 1 2 2
## [2,] 2 3 4 2 3 4
As you can see, apply()
returns its results in an unexpected transposition, so we have to fix it with t()
to prepare for the upcoming duplicated()
call:
t(apply(df[1:2],1L,sort));
## [,1] [,2]
## [1,] 1 2
## [2,] 1 3
## [3,] 1 4
## [4,] 1 2
## [5,] 2 3
## [6,] 2 4
Now we can use duplicated()
to get a logical vector indicating which rows are duplicates of previous rows:
duplicated(t(apply(df[1:2],1L,sort)));
## [1] FALSE FALSE FALSE TRUE FALSE FALSE
We then invert the logical vector with a negation, to get just those rows that are not duplicates of any previous rows:
!duplicated(t(apply(df[1:2],1L,sort)));
## [1] TRUE TRUE TRUE FALSE TRUE TRUE
Finally we can use the resulting logical vector to index out just those rows of df
that are not duplicates of any previous rows:
df[!duplicated(t(apply(df[1:2],1L,sort))),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4
Therefore, the first occurrence of every set of post-sort duplicates will be retained, the remainder will be removed.
Excellent suggestion from @RichardScriven; we can replace the t()
call with the MARGIN
argument of duplicated()
, which will likely be slightly faster:
df[!duplicated(apply(df[1:2],1L,sort),MARGIN=2L),];
## A B prob
## 1 1 2 0.1
## 2 1 3 0.2
## 3 1 4 0.3
## 5 2 3 0.1
## 6 2 4 0.4