1

Hey guys I definitely solved this problem before but I lost my code... Here is a simplification of what I have.

a1 <- c(1,2,4,3,5)
a2 <- c("a","b","b","c","f")
a3 <- c(3,4,"b",1,9)
a4 <- c("c","b",2,"a","d")
a <- cbind(a1,a2,a3,a4)

a1 and a2 are a set as well as a3 and a4:

enter image description here

I would like to remove the duplicates. So remove rows 3 and 4. This data comes from a blast showing links between genomes and it is 34,000 rows long so a efficient solution would be great.

Thank you so much! I would also be open to doing this in another language.

Hcorg
  • 11,598
  • 3
  • 31
  • 36
kradja
  • 58
  • 6

1 Answers1

0

We can sort the 'a' by row, get the logical index of not (!) duplicated elements and use that to filter the rows.

i1 <- !duplicated(t(apply(a, 1, sort)))
a1 <- a[i1,]

The index of rows that remains in the dataset are

which(i1)
#[1] 1 2 5
akrun
  • 874,273
  • 37
  • 540
  • 662