How can I delete certain rows according to two columns which have symmetricl values in data.table in R?

Question

For example, I have a table as follows:

DT <- data.table(
  A = c(1,1,1,2,2,2,3,3,3), 
  B = c(1,2,3,1,2,3,1,2,3),
  key = "A"
)

I wand to delete the rows under the conditon such as "A" == 2 and "B" == 1, since there is already the row that "A" == 1 and "B" == 2.

In short, I want to delete the rows that already appears symmetrically in the previous rows, how can I realize it?

@chinsoon12 i think we landed the same place, except I found the `duplicated.matrix` method — MichaelChirico, Aug 24 '18 at 03:19
If you are trying to remove duplicated rows irrespective of column positions, there are many duplicates for this. https://stackoverflow.com/questions/22980423/r-find-duplicated-rows-regardless-of-order or https://stackoverflow.com/questions/9028369/removing-duplicate-combinations-irrespective-of-order or https://stackoverflow.com/questions/25297812/pair-wise-duplicate-removal-from-dataframe — Ronak Shah, Aug 24 '18 at 03:22

score 2 · Answer 1 · answered Aug 24 '18 at 03:19

2

Maybe not the most efficient, but leverage the duplicated.matrix method:

DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
#    A B
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 2 2
# 5: 2 3
# 6: 3 3

answered Aug 24 '18 at 03:19

MichaelChirico

33,841
14
113
198

score 2 · Answer 2 · answered Aug 24 '18 at 13:37

Another option:

DT[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]

   A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3

So ...

make a grouping variable as A + B,
flip the order to B + A on subset A < B or A > B
dedupe on the grouping variable

The last step could alternately be unique(DT, by="g").

score 1 · Answer 3 · answered Aug 24 '18 at 03:29

1

if you only have two columns, then you could do:

 unique(do.call(function(A,B)data.table(A=pmin(A,B),B=pmax(A,B)),DT))
   A B
1: 1 1
2: 1 2
3: 1 3
4: 2 2
5: 2 3
6: 3 3

answered Aug 24 '18 at 03:29

Onyambu

67,392
3
24
53

chinsoon12 · Answer 4 · 2018-08-28T10:36:43.607

Another method for the case where there are only 2 columns, using anti-join.

dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
ans <- unique(DT)[!dupes, on=.(A, B)]

timing code:

library(data.table)
set.seed(0L)

nr <- 1e5
nElem <- 1e3
mat <- matrix(sample(nElem, nr*2, replace=TRUE), ncol=2)
DT <- as.data.table(mat)
setnames(DT, c("A", "B"))
DT2 <- copy(DT)

library(microbenchmark)
mtd1 <- function() unique(data.frame(A=pmin(mat[, 1], mat[, 2]), B=pmax(mat[, 1], mat[, 2])))
mtd2 <- function() DT[!duplicated(apply(cbind(A, B), 1L, sort), MARGIN = 2L)]
mtd3 <- function() DT2[, g := paste(B, A, sep="_")][A < B, g := paste(A, B, sep="_")][!duplicated(g), !"g"]
mtd4 <- function() {
    dupes <- unique(DT[B > A])[unique(DT[A < B]), on=c("A"="B", "B"="A")]
    ans <- unique(DT)[!dupes, on=.(A, B)]
}
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),times=3L)

some timings:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd1()  118.62051  129.50581  153.77216  140.39111  171.34799  202.30487     3
 mtd2() 3500.47877 3552.80879 3732.67006 3605.13882 3848.76571 4092.39260     3
 mtd3()   89.22901   92.94830   97.22658   96.66759  101.22536  105.78313     3
 mtd4()   28.61628   32.37641   50.90126   36.13654   62.04375   87.95096     3

But the fastest is eddi's method: data.table with two string columns of set elements, extract unique rows with each row unsorted

mtd5 <- function() DT[DT[, .I[1L], by=.(pmin(A, B), pmax(A, B))]$V1]
microbenchmark(mtd1(),mtd2(),mtd3(),mtd4(),mtd5(),times=3L)

timings:

Unit: milliseconds
   expr        min         lq       mean     median         uq        max neval
 mtd1()  149.62224  150.70685  175.66394  151.79146  188.68479  225.57813     3
 mtd2() 4126.51014 4140.72876 4277.37907 4154.94738 4352.81353 4550.67968     3
 mtd3()  126.01679  131.26463  134.63642  136.51247  138.94624  141.38000     3
 mtd4()   39.24141   42.42815   45.65804   45.61489   48.86635   52.11781     3
 mtd5()   12.58396   16.68156   18.21613   20.77915   21.03221   21.28527     3

How can I delete certain rows according to two columns which have symmetricl values in data.table in R?

4 Answers4

Linked

Related