0

I have a list consisting of 800 elements. Each element is a character vector.

I want to go through this list and identify duplicate elements, of which there are many.

Is there a way of doing this?

eg:

mylist[[1]] = c('aaab','aaab','aaab', 'abcd')
mylist[[2]] = c('defg','defg','defg','abcd')
...
mylist[[80]] = c('ghgh','ghgh','ghgh','abcd')

in which case I want to find that there's a duplicate entry in the 1st, 2nd and 80th elements ('abcd')

user2846211
  • 949
  • 6
  • 16
  • 24
  • Does [this](https://stat.ethz.ch/R-manual/R-devel/library/base/html/unique.html) help? – Nitish Mar 11 '14 at 14:28
  • 1
    It difficult to say more without seeing the data, and what output would you prefer, but you could try `lapply(list, duplicated)` – TWL Mar 11 '14 at 14:30
  • unfortunately 'unique' and 'duplicated' don't work with this – user2846211 Mar 11 '14 at 14:32
  • Could you provide a sample dataset: http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example? – TWL Mar 11 '14 at 14:35
  • I've added an example of what I mean above - thanks!! – user2846211 Mar 11 '14 at 14:42
  • 2
    Still not totally clear. Are you only concerned about a duplicated item across different list indices? Or does "defg" also count as a duplicate because it appears three times in `mylist[[2]]`? What are you hoping to have returned? The position (index) where there are duplicated values? The value that is duplicated? – A5C1D2H2I1M1N2O1R2T1 Mar 11 '14 at 16:37

3 Answers3

2

I still find this question under-developed, but you can probably get towards where you're trying to be with a combination of stack + table + colSums or rowSums, depending on your need(s).

Some sample data:

mylist <- list(c("aaab", "aaab", "aaab", "abcd"),
               c("defg", "defg", "defg", "abcd"), 
               c("ghgh", "ghgh", "ghgh", "abcd"), 
               c("aaaa", "aaaa", "aaaa", "aaaa"))

stack puts this into a long data.frame with two columns, "ind" and "values". "ind" corresponds to the list index number, while "value" refers to the... value.

X <- stack(setNames(mylist, seq_along(mylist)))

Using table gives us the frequency of each term by "ind".

table(X)
#       ind
# values 1 2 3 4
#   aaaa 0 0 0 4
#   aaab 3 0 0 0
#   abcd 1 1 1 0
#   defg 0 3 0 0
#   ghgh 0 0 3 0

colSums would tell us which list items have duplicated items within themselves.

colSums(table(X) > 0)
# 1 2 3 4 
# 2 2 2 1 
which(colSums(table(X) > 0) > 1)
# 1 2 3 
# 1 2 3 

rowSums would tell us which list items have duplicated items among themselves.

rowSums(table(X) > 0)
# aaaa aaab abcd defg ghgh 
#    1    1    3    1    1
which(rowSums(table(X) > 0) > 1)
# abcd 
#    3 
names(which(table(X)["abcd", ] >= 1))
# [1] "1" "2" "3"
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • Isn't that `colSumns(table(X) > 1)` to have duplicate indication ? (and btw thanks for this clever answer, exactly what I was looking for :-) ) – Cath Nov 24 '17 at 12:19
1

Perhaps overkill, but the tm package provides support for tasks like this:

library(tm)
mylist <- list(c("aaab", "aaab", "aaab", "abcd"),
          c("defg", "defg", "defg", "abcd"), c("ghgh", "ghgh", "ghgh", "abcd"))

m <- as.matrix(TermDocumentMatrix(Corpus(VectorSource(mylist))))
m[which(rowSums(!!m)>1),,drop=FALSE]
      Docs
Terms  1 2 3
  abcd 1 1 1
James
  • 65,548
  • 14
  • 155
  • 193
0

It seems like all your list elements have the same size. If that's the case you could make it a matrix:

mylist <- list(c('aaab','aaab','aaab','abcd'),
               c('defg','defg','defg','abcd'),
               c('defg','defg','defg','ghij'),
               c('ghgh','ghgh','ghgh','abcd'))

mat <- do.call(rbind, mylist)

apply(mat, 2, duplicated)
#      [,1]  [,2]  [,3]  [,4]
#[1,] FALSE FALSE FALSE FALSE
#[2,] FALSE FALSE FALSE  TRUE
#[3,]  TRUE  TRUE  TRUE FALSE
#[4,] FALSE FALSE FALSE  TRUE
Roland
  • 127,288
  • 10
  • 191
  • 288