fast identify duplicated rows (with same entries) in matrix in R

Question

I have a matrix like this:

M <- rbind(c("CD4", "CD8"),
           c("CD8", "CD4"),
           c("DN", "CD8"),
           c("CD8", "DN"),
           c("CD4", "DN"),
           c("DN", "CD4"))

The 1st and 2nd is duplicated, and 3rd and 4th is duplicated, and 5th and 6th is duplicated since they included the same elements (no matter what order it is).

I know that the following code can did it.

Msort <- t(apply(M, 1, sort))
duplicated(Msort)

I want to get this Logical vector:

> duplicated(Msort)
[1] FALSE  TRUE FALSE  TRUE FALSE  TRUE

But if the matrix is large, say 10,000 rows and 10,000 columns, how to deal with this situation efficicently?

Thanks.

Data.Table is your friend! I think it's outlined nicely [here](http://stackoverflow.com/questions/19392332/find-all-duplicated-records-in-data-table-not-all-but-one). — ike, Feb 15 '17 at 16:20
@d.b I already installed igraph. here is the result. `> is.mutual(graph(t(M))) [1] TRUE TRUE TRUE TRUE TRUE TRUE ` — BioChemoinformatics, Feb 15 '17 at 18:57
@BioChemoinformatics, I'm not sure if I'm helping you but `get.edgelist( as.undirected( graph(t(M))))` seems to give you the unique rows. — d.b, Feb 15 '17 at 19:18
@d.b thanks. I would like these logical vector like `> duplicated(Msort) [1] FALSE TRUE FALSE TRUE FALSE TRUE` — BioChemoinformatics, Feb 15 '17 at 19:24

score 0 · Answer 1 · answered Feb 15 '17 at 16:27

0

I have tried to do using matrix. Please try this once:

M[duplicated(M[c("V1", "V2")]),]
#     [,1]  [,2] 
#[1,] "CD8" "CD4"
#[2,] "CD8" "DN" 
#[3,] "DN"  "CD4"

answered Feb 15 '17 at 16:27

Zico

185
12

data.frame is not the best approach here, since we are talking about a 10,000x10,000 matrix. As ike has suggested, data.table is the way to go – Akbar Feb 15 '17 at 16:30
have a look plz: http://stackoverflow.com/questions/15690688/r-checking-for-duplicates-is-painfully-slow-even-with-mclapply – Zico Feb 15 '17 at 16:34
@Zico Can you say more about about your method. In addition, I would like to known which one is duplicated, not just having the unique rows. thanks. – BioChemoinformatics Feb 15 '17 at 16:44
You do realize that we are dealing with more than one column at a time here, so your example does not apply. I agree that you could use data.frame with piping to achieve fast results but your example is out dated. See [here](http://stackoverflow.com/questions/36930063/remove-duplicate-rows-of-a-matrix-or-dataframe?rq=1) – Akbar Feb 15 '17 at 16:53
@Akbar sorry if I have confused you. I just wanted to give a quick example. my next query would have been are we matching 10000 columns at a time or x (<10000) no. of columns? – Zico Feb 15 '17 at 17:08
@BioChemoinformatics can you please confirm no. of columns you are checking for duplicates at one time? – Zico Feb 15 '17 at 17:08
@BioChemoinformatics you can check this link also http://stackoverflow.com/questions/15690688/r-checking-for-duplicates-is-painfully-slow-even-with-mclapply – Zico Feb 15 '17 at 17:09
@Zico For example the number of columns is 1000, then `M[duplicated(M[paste0("V", 1:1000)]), ]`? Why it does not work if I just use the individual operation such as `M[c("V1", "V2")]`? – BioChemoinformatics Feb 15 '17 at 17:15
1

you can try this: a <- paste0("V", 1:1000); M[duplicated(M[c(paste(a))]),]; but when 1000 columns are matching (for big data), i guess mapreduce would have been a good use here, but I am not so sure. and for data.table I think you can check links from ike – Zico Feb 15 '17 at 17:28
@Zico thanks. If I want to the Logical vector: `> duplicated(Msort) [1] FALSE TRUE FALSE TRUE FALSE TRUE` what is the method I should do? – BioChemoinformatics Feb 15 '17 at 19:24
@BioChemoinformatics what are you trying to achieve here? – Zico Feb 16 '17 at 14:17

fast identify duplicated rows (with same entries) in matrix in R

1 Answers1