Find duplicates between multiple list elements

Question

I have a list consisting of 800 elements. Each element is a character vector.

I want to go through this list and identify duplicate elements, of which there are many.

Is there a way of doing this?

eg:

mylist[[1]] = c('aaab','aaab','aaab', 'abcd')
mylist[[2]] = c('defg','defg','defg','abcd')
...
mylist[[80]] = c('ghgh','ghgh','ghgh','abcd')

in which case I want to find that there's a duplicate entry in the 1st, 2nd and 80th elements ('abcd')

Does [this](https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html) help? — Nitish, Mar 11 '14 at 14:28
It difficult to say more without seeing the data, and what output would you prefer, but you could try `lapply(list, duplicated)` — TWL, Mar 11 '14 at 14:30
unfortunately 'unique' and 'duplicated' don't work with this — user2846211, Mar 11 '14 at 14:32
Could you provide a sample dataset: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example? — TWL, Mar 11 '14 at 14:35
Still not totally clear. Are you only concerned about a duplicated item across different list indices? Or does "defg" also count as a duplicate because it appears three times in `mylist[[2]]`? What are you hoping to have returned? The position (index) where there are duplicated values? The value that is duplicated? — A5C1D2H2I1M1N2O1R2T1, Mar 11 '14 at 16:37

score 2 · Answer 1 · answered Mar 11 '14 at 16:52

I still find this question under-developed, but you can probably get towards where you're trying to be with a combination of stack + table + colSums or rowSums, depending on your need(s).

Some sample data:

mylist <- list(c("aaab", "aaab", "aaab", "abcd"),
               c("defg", "defg", "defg", "abcd"), 
               c("ghgh", "ghgh", "ghgh", "abcd"), 
               c("aaaa", "aaaa", "aaaa", "aaaa"))

stack puts this into a long data.frame with two columns, "ind" and "values". "ind" corresponds to the list index number, while "value" refers to the... value.

X <- stack(setNames(mylist, seq_along(mylist)))

Using table gives us the frequency of each term by "ind".

table(X)
#       ind
# values 1 2 3 4
#   aaaa 0 0 0 4
#   aaab 3 0 0 0
#   abcd 1 1 1 0
#   defg 0 3 0 0
#   ghgh 0 0 3 0

colSums would tell us which list items have duplicated items within themselves.

colSums(table(X) > 0)
# 1 2 3 4 
# 2 2 2 1 
which(colSums(table(X) > 0) > 1)
# 1 2 3 
# 1 2 3

rowSums would tell us which list items have duplicated items among themselves.

rowSums(table(X) > 0)
# aaaa aaab abcd defg ghgh 
#    1    1    3    1    1
which(rowSums(table(X) > 0) > 1)
# abcd 
#    3 
names(which(table(X)["abcd", ] >= 1))
# [1] "1" "2" "3"

Isn't that `colSumns(table(X) > 1)` to have duplicate indication ? (and btw thanks for this clever answer, exactly what I was looking for :-) ) — Cath, Nov 24 '17 at 12:19

James · Accepted Answer · 2014-03-11T15:02:39.040

1

Perhaps overkill, but the tm package provides support for tasks like this:

library(tm)
mylist <- list(c("aaab", "aaab", "aaab", "abcd"),
          c("defg", "defg", "defg", "abcd"), c("ghgh", "ghgh", "ghgh", "abcd"))

m <- as.matrix(TermDocumentMatrix(Corpus(VectorSource(mylist))))
m[which(rowSums(!!m)>1),,drop=FALSE]
      Docs
Terms  1 2 3
  abcd 1 1 1

edited Mar 11 '14 at 15:02

answered Mar 11 '14 at 14:45

James

65,548
14
155
193

score 0 · Answer 3 · answered Mar 11 '14 at 15:24

It seems like all your list elements have the same size. If that's the case you could make it a matrix:

mylist <- list(c('aaab','aaab','aaab','abcd'),
               c('defg','defg','defg','abcd'),
               c('defg','defg','defg','ghij'),
               c('ghgh','ghgh','ghgh','abcd'))

mat <- do.call(rbind, mylist)

apply(mat, 2, duplicated)
#      [,1]  [,2]  [,3]  [,4]
#[1,] FALSE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE  TRUE
#[3,]  TRUE  TRUE  TRUE FALSE
#[4,] FALSE FALSE FALSE  TRUE

Find duplicates between multiple list elements

3 Answers3