0

Here is my data:

mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 5, 9, 6, 6, 4, 1, 4), .Dim = c(7L, 2L))

Some rows are duplicated, several other rows contain the same elements although they are differentially ordered. I wish to remove all rows that contain the same elements, whether these elements are in the same (duplicated rows) or different order. This will retain only the first row of c(3, 5).

I checked previous questions here and here. However, my requirement is that all such rows are removed rather than leaving one such row.

My question is also different from this one which removes all duplicated rows in that I look for rows not just duplicated, but also those that contain the same set of elements that are ordered differently. For example, rows c(6, 9) and c(9, 6) should both be removed since they both contian the same set of elements.

I look for solutions not using for loop since my real data is large and for loop may be slow.

Note: My full data has 40k rows and 2 columns.

user438383
  • 5,716
  • 8
  • 28
  • 43
Patrick
  • 1,057
  • 9
  • 23
  • Is your full data only two columns? Or many? – Harrison Jones Sep 24 '21 at 12:38
  • My full data has 40k rows and 2 columns. – Patrick Sep 24 '21 at 12:40
  • This only removes duplicated rows. I wish to remove rows as long as they contain the same elements, even if the rows are not duplicated. For example, rows `c(6, 9)` and `c(9, 6)` should both be removed since they both contian the same set of elements. – Patrick Sep 24 '21 at 12:51
  • Your data is a `matrix` yet you tagged [tag:dplyr] (which does not operate on matrices), is there something else here? – r2evans Sep 24 '21 at 12:53
  • I tag dplyr since I wonder there may be some handy functions there. If indeed, I could simply transform my data from matrix to dataframe. – Patrick Sep 24 '21 at 12:56
  • @mnist, I don't agree with the duplicate tag. The post linked to the duplicate tag doesn't address the ordering problem, which is why the OP posted a new problem. See "My question is also different from this one..." in the original post. – jblood94 Sep 24 '21 at 16:24
  • Thank you @jblood94. I totally agree with you. – Patrick Sep 24 '21 at 16:51
  • Are your data all integer values? – jblood94 Sep 24 '21 at 17:49
  • Yes, all are integers. – Patrick Sep 25 '21 at 04:21
  • 1
    @Patrick, I'd have posted this as an answer if the question weren't closed: `m <- rowSums(1/mymat)`; `mymatNoDup <- mymat[!(duplicated(m) | duplicated(m, fromLast = TRUE)),]` It's about 2 orders of magnitude faster than any posted answer. It works for positive integers, but if you have zeros/negatives, just add to `mymat` when inverting. – jblood94 Sep 25 '21 at 12:12
  • @Patrick, sorry, that should have been `m <- rowSums(mymat + 1/mymat)`. Any binary symmetric pairing function would work. See https://math.stackexchange.com/questions/3162166/what-function-symmetric-and-has-unique-solution for another example. – jblood94 Sep 25 '21 at 13:53

5 Answers5

2

You can sort the data rowwise and use duplicated -

tmp <- t(apply(mymat, 1, sort))
tmp[!(duplicated(tmp) | duplicated(tmp, fromLast = TRUE)), , drop = FALSE]

#     [,1] [,2]
#[1,]    3    5
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
1

I added a little data to show that the matrix format remains

mymat <- structure(c(3, 6, 9, 9, 1, 4, 1, 10, 12, 13, 14, 5, 9, 6, 6, 4, 1, 4, 11, 13, 12, 15), .Dim = c(11L, 2L))

dup <- duplicated(rbind(mymat, mymat[, c(2, 1)]))
dup_fromLast <- duplicated(rbind(mymat, mymat[, c(2, 1)]), fromLast = TRUE)

mymat_duprm <- mymat[!(dup_fromLast | dup)[1:(length(dup) / 2)], ]

mymat_duprm
Harrison Jones
  • 2,256
  • 5
  • 27
  • 34
  • This will drop unique rows with repeated values. Try, e.g., `mymat <- matrix(c(1,3,2,3), nrow = 2)` – jblood94 Sep 24 '21 at 17:40
1

As a matrix:

tmp <- apply(mymat, 1, function(z) toString(sort(z)))
mymat[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE]
#      [,1] [,2]
# [1,]    3    5

The drop=FALSE is required only because (at least with this sample data) the filtering results in one row. While I doubt your real data (with 40k rows) would reduce to this, I recommend you keep it in there anyway ("just in case", and it's just defensive programming).

r2evans
  • 141,215
  • 6
  • 77
  • 149
1

Benchmarking a couple new solutions along with a few already posted:

library(Rfast)
library(microbenchmark)

mymat <- matrix(sample(100, 4000, replace = TRUE), nrow = 2000)

noDup <- function(m) {
  return(!(duplicated(m) | duplicated(m, fromLast = TRUE)))
}

combounique1 <- function(m) {
  return(m[noDup(rowSort(m)),])
}

combounique2 <- function(m) {
  msum <- rowsums(m)
  return(m[noDup(rowsums(m^2) + msum + (msum - 3)*abs(m[,1] - m[,2])),])
}

combounique3 <- function(m) {
  return(m[noDup(rowsums(m + 1/m)),])
}

combounique4 <- function(m) {
  # similar to Harrison Jones, but correct
  return(m[noDup(rbind(m, m[m[,1] != m[,2], 2:1]))[1:nrow(m)],])
}

combounique5 <- function(m) {
  # similar to Ronak Shah, but maintains ordering within rows
  tmp <- t(apply(m, 1, sort))
  return(m[noDup(tmp),])
}

r2evans <- function(m) {
  tmp <- apply(m, 1, function(z) toString(sort(z)))
  return(m[ave(tmp, tmp, FUN = length) == "1",, drop = FALSE])
}

microbenchmark(mymat1 <- combounique1(mymat),
               mymat2 <- combounique2(mymat),
               mymat3 <- combounique3(mymat),
               mymat4 <- combounique4(mymat),
               mymat5 <- combounique5(mymat),
               mymat6 <- r2evans(mymat))

                          expr     min       lq      mean   median       uq      max neval
 mymat1 <- combounique1(mymat)  7129.9  7642.30  9236.841  8205.45  9467.70  28363.7   100
 mymat2 <- combounique2(mymat)   171.0   197.30   219.341   215.75   225.45    385.5   100
 mymat3 <- combounique3(mymat)   144.2   166.95   187.340   182.50   192.30    306.7   100
 mymat4 <- combounique4(mymat) 14263.1 15343.90 17938.061 16417.30 19043.30  34884.9   100
 mymat5 <- combounique5(mymat) 48230.9 50773.75 57662.463 55041.90 60968.35 193804.2   100
      mymat6 <- r2evans(mymat) 66180.3 70835.30 78642.552 77299.85 81992.60 161034.5   100

> all(sapply(list(mymat1, mymat2, mymat3, mymat4, mymat5, mymat6), FUN = identical, mymat1))
[1] TRUE

Note that combounique2 and combounique3 are only strictly correct for integer values. The idea is to use a symmetric pairing function to get a unique value for each pair of integers, then use duplicated on that. (see https://math.stackexchange.com/questions/3162166/what-function-symmetric-and-has-unique-solution)

jblood94
  • 10,340
  • 1
  • 10
  • 15
0

You can just use, the following line of code:

mymat <- mymat[!mymat[,1] %in% mymat[,2], , drop = FALSE]

output:

mymat
#>      [,1] [,2]
#> [1,]    3    5

Created on 2021-09-24 by the reprex package (v0.3.0)

lovalery
  • 4,524
  • 3
  • 14
  • 28