0

I have a matrix like this:

M <- rbind(c("CD4", "CD8"),
           c("CD8", "CD4"),
           c("DN", "CD8"),
           c("CD8", "DN"),
           c("CD4", "DN"),
           c("DN", "CD4"))

The 1st and 2nd is duplicated, and 3rd and 4th is duplicated, and 5th and 6th is duplicated since they included the same elements (no matter what order it is).

I know that the following code can did it.

Msort <- t(apply(M, 1, sort))
duplicated(Msort)

I want to get this Logical vector:

> duplicated(Msort)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

But if the matrix is large, say 10,000 rows and 10,000 columns, how to deal with this situation efficicently?

Thanks.

1 Answers1

0

I have tried to do using matrix. Please try this once:

M[duplicated(M[c("V1", "V2")]),]
#     [,1]  [,2] 
#[1,] "CD8" "CD4"
#[2,] "CD8" "DN" 
#[3,] "DN"  "CD4"
Zico
  • 185
  • 12
  • data.frame is not the best approach here, since we are talking about a 10,000x10,000 matrix. As ike has suggested, data.table is the way to go – Akbar Feb 15 '17 at 16:30
  • have a look plz: http://stackoverflow.com/questions/15690688/r-checking-for-duplicates-is-painfully-slow-even-with-mclapply – Zico Feb 15 '17 at 16:34
  • @Zico Can you say more about about your method. In addition, I would like to known which one is duplicated, not just having the unique rows. thanks. – BioChemoinformatics Feb 15 '17 at 16:44
  • You do realize that we are dealing with more than one column at a time here, so your example does not apply. I agree that you could use data.frame with piping to achieve fast results but your example is out dated. See [here](http://stackoverflow.com/questions/36930063/remove-duplicate-rows-of-a-matrix-or-dataframe?rq=1) – Akbar Feb 15 '17 at 16:53
  • @Akbar sorry if I have confused you. I just wanted to give a quick example. my next query would have been are we matching 10000 columns at a time or x (<10000) no. of columns? – Zico Feb 15 '17 at 17:08
  • @BioChemoinformatics can you please confirm no. of columns you are checking for duplicates at one time? – Zico Feb 15 '17 at 17:08
  • @BioChemoinformatics you can check this link also http://stackoverflow.com/questions/15690688/r-checking-for-duplicates-is-painfully-slow-even-with-mclapply – Zico Feb 15 '17 at 17:09
  • @Zico For example the number of columns is 1000, then `M[duplicated(M[paste0("V", 1:1000)]), ]`? Why it does not work if I just use the individual operation such as `M[c("V1", "V2")]`? – BioChemoinformatics Feb 15 '17 at 17:15
  • 1
    you can try this: a <- paste0("V", 1:1000); M[duplicated(M[c(paste(a))]),]; but when 1000 columns are matching (for big data), i guess mapreduce would have been a good use here, but I am not so sure. and for data.table I think you can check links from ike – Zico Feb 15 '17 at 17:28
  • @Zico thanks. If I want to the Logical vector: `> duplicated(Msort) [1] FALSE TRUE FALSE TRUE FALSE TRUE` what is the method I should do? – BioChemoinformatics Feb 15 '17 at 19:24
  • @BioChemoinformatics what are you trying to achieve here? – Zico Feb 16 '17 at 14:17