4

I have an input input matrix

df <- data.frame(a = c(1,1,2,4,3,5,2,1,1,3), b = c(4,3,3,1,2,2,4,4,4,2), d = LETTERS[1:10])

I want to get

out <- data.frame(a = c(1,2,4,3,1,1,3), b = c(4,3,1,2,4,4,2), d = c(A,C,D,E,H,I,J))

#   a b d
# 1 1 4 A
# 2 2 3 C
# 3 4 1 D
# 4 3 2 E
# 5 1 4 H
# 6 1 4 I
# 7 3 2 J

I want to extract any rows that are duplicated in both columns – also in reverse order

I tried df[duplicated(df[c("a")]) | duplicated(df[c("b")]) ,] but it does not work.

Any suggestion?

Psidom
  • 209,562
  • 33
  • 339
  • 356
EffeBi
  • 41
  • 3
  • 1
    I'm not sure if this is a duplicate, but it's pretty close to several other questions. The only difference from the more common case is that you want to keep the non-unique rows rather than remove them, which is just a matter of negation. E.g. [Extracting unique combination rows from a data frame](http://stackoverflow.com/questions/21194611/extracting-unique-combination-rows-from-a-data-frame-in-r) or [pair-wise duplicate removal from dataframe](http://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe). There are a variety of other questions in a similar vein. – Jota Sep 21 '16 at 23:44

3 Answers3

2

You can group the data frame by rowwise sorted columns a and b using pmin and pmax of column a and b, and then filter based on the condition that the group contains at least two rows:

library(dplyr)
df %>% 
       group_by(pmin(a,b), pmax(a,b)) %>% 
       filter(n() >= 2) %>% 
       ungroup() %>% 
       select(a,b,d)

# Source: local data frame [7 x 3]
# 
#       a     b      d
#   <dbl> <dbl> <fctr>
# 1     1     4      A
# 2     2     3      C
# 3     4     1      D
# 4     3     2      E
# 5     1     4      H
# 6     1     4      I
# 7     3     2      J
Psidom
  • 209,562
  • 33
  • 339
  • 356
  • 1
    And a base R equivalent (or at least similar approach) might be `dfDups <- with(df, cbind(pmin(a,b), pmax(a,b))); df[duplicated(dfDups) | duplicated(dfDups, fromLast = TRUE), ]` – Jota Sep 21 '16 at 23:33
1

In base R you could use duplicated together with apply:

df[(duplicated(df$a)&duplicated(df$b))|
   apply(df,1, function(l) sum((l[["a"]]==df$b)&(l[["b"]]==df$a))>0),]

   a b d
1  1 4 A
3  2 3 C
4  4 1 D
5  3 2 E
7  2 4 G
8  1 4 H
9  1 4 I
10 3 2 J
HubertL
  • 19,246
  • 3
  • 32
  • 51
  • `base R` (i.e. your sollution) appears to be faster!!! My microbenchmarks with the data provided show that your solution is almost 4 times faster than the `dplyr` solution. – Joseph Wood Sep 22 '16 at 00:24
0

One can also use pairwise max and min (pmax and pmin) to override the order and then find the duplicate rows from first and last, and merge the two results. Albeit a long solution , it might be of interest:

df <- data.frame(a = c(1,1,2,4,3,5,2,1,1,3), b = c(4,3,3,1,2,2,4,4,4,2), d = LETTERS[1:10])

out <- data.frame(a = c(1,2,4,3,1,1,3), b = c(4,3,1,2,4,4,2), d = c('A','C','D','E','H','I','J'))    


mx<- with (df, pmax(a,b))
mn<- with (df, pmin(a,b))

df2<- data.frame(mx, mn)
df2

a<- df[duplicated(df2),]
b<- df[duplicated(df2,fromLast = T),]

res<- merge(a,b,all = T)
res<- res[order(res$d),]

res 
out

#check
sum (as.character(res$d) !=as.character(out$d) )
R.S.
  • 2,093
  • 14
  • 29