Find duplicate items in all lists of a list of lists and remove them

Question

I've read loads of examples but not quite finding what i'm looking for. Tried several ways of doing this but looking for the best.

So the idea is that given:

s1 = ['a','b','c']
s2 = ['a','potato','d']
s3 = ['a','b','h']
strings=[s1,s2,s3]

the results should be:

['c']
['potato','d']
['h']

because these items are unique across the whole list of lists.

Thank you for any suggestions :)

https://stackoverflow.com/questions/2213923/removing-duplicates-from-a-list-of-lists -already answered pls check this link.... — Lakshmi Ram, Apr 02 '20 at 11:30

Mazdak · Accepted Answer · 2020-04-02T11:39:37.863

As a general approach you can keep a counter of all items and then keep those that have appeared only once.

In [21]: from collections import Counter 

In [23]: counts = Counter(s1 + s2 + s3)                                                                                                                                                                     

In [24]: [i for i in s1 if counts[i] == 1]                                                                                                                                                                  
Out[24]: ['c']

In [25]: [i for i in s2 if counts[i] == 1]                                                                                                                                                                  
Out[25]: ['potato', 'd']

In [26]: [i for i in s3 if counts[i] == 1]                                                                                                                                                                  
Out[26]: ['h']

And if you have a nested list you can do the following:

In [28]: s = [s1, s2, s3]                                                                                                                                                                                   

In [30]: from itertools import chain                                                                                                                                                                        

In [31]: counts = Counter(chain.from_iterable(s))                                                                                                                                                           

In [32]: [[i for i in lst if counts[i] == 1] for lst in s]                                                                                                                                                  
Out[32]: [['c'], ['potato', 'd'], ['h']]

What a beautiful and elegant solution. I'm going to replace my own function for removing duplicates with this. Thank you. — mutantkeyboard, Apr 02 '20 at 13:29

score 1 · Answer 2 · answered Apr 02 '20 at 11:25

How about:

[i for i in s1 if i not in s2+s3] #gives ['c']
[j for j in s2 if j not in s1+s3] #gives ['potato', 'd']
[k for k in s3 if k not in s1+s2] #gives ['h']

If you want all of them in a list:

uniq = [[i for i in s1 if i not in s2+s3],
[j for j in s2 if j not in s1+s3],
[k for k in s3 if k not in s1+s2]]

#output
[['c'], ['potato', 'd'], ['h']]

score 1 · Answer 3 · answered Apr 02 '20 at 11:36

1

To find out the unique elements across the 3 lists you can use the set Symmetric difference(^) operation along with union(|) operation since you have 3 lists.

>>> s1 = ['a','b','c']
>>> s2 = ['a','potato','d']
>>> s3 = ['a','b','h']

>>> (set(s1) | (set(s2)) ^ set(s3)

answered Apr 02 '20 at 11:36

Mohamed Shabeer kp

834
7
15

This doesn't work because symmetric_difference will return values that are present an odd number of times (e.g. 'a') – Alain T. Apr 02 '20 at 12:22
if we use union along with the symmetric_difference its possible. – Mohamed Shabeer kp Apr 02 '20 at 13:46

norok2 · Answer 4 · 2020-04-02T15:20:28.883

Assuming that you want this to work for an arbitrary number of sequences, a direct (but likely not the most efficient -- probably the others object can be constructed from the last iteration) way to solve this would be:

def deep_unique_set(*seqs):
    for i, seq in enumerate(seqs):
        others = set(x for seq_ in (seqs[:i] + seqs[i + 1:]) for x in seq_)
        yield [x for x in seq if x not in others]

or the slightly faster but less memory efficient and otherwise identical:

def deep_unique_preset(*seqs):
    pile = list(x for seq in seqs for x in seq)
    k = 0
    for seq in seqs:
        num = len(seq)
        others = set(pile[:k] + pile[k + num:])
        yield [x for x in seq if x not in others]
        k += num

Testing it with the provided input:

s1 = ['a', 'b', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']

print(list(deep_unique_set(s1, s2, s3)))
# [['c'], ['potato', 'd'], ['h']]
print(list(deep_unique_preset(s1, s2, s3)))
# [['c'], ['potato', 'd'], ['h']]

Note that if the input contain duplicates within one of the lists, they are not removed, i.e.:

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']

print(list(deep_unique_set(s1, s2, s3)))
# [['c', 'c'], ['potato', 'd'], ['h']]
print(list(deep_unique_preset(s1, s2, s3)))
# [['c', 'c'], ['potato', 'd'], ['h']]

If all duplicates should be removed, a better approach is to count the values. The method of choice for this is by using collections.Counter, as proposed in @Kasramvd answer:

def deep_unique_counter(*seqs):
    counts = collections.Counter(itertools.chain.from_iterable(seqs))
    for seq in seqs:
        yield [x for x in seq if counts[x] == 1]

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']
print(list(deep_unique_counter(s1, s2, s3)))
# [[], ['potato', 'd'], ['h']]

Alternatively, one could keep track of repeats, e.g.:

def deep_unique_repeat(*seqs):
    seen = set()
    repeated = set(x for seq in seqs for x in seq if x in seen or seen.add(x))
    for seq in seqs:
        yield [x for x in seq if x not in repeated]

which will have the same behavior as the collections.Counter-based approach:

s1 = ['a', 'b', 'c', 'c']
s2 = ['a', 'potato', 'd']
s3 = ['a', 'b', 'h']
print(list(deep_unique_repeat(s1, s2, s3)))
# [[], ['potato', 'd'], ['h']]

but is slightly faster, since it does not need to keep track of unused counts.

Another, highly inefficient, make use of list.count() for counting instead of a global counter:

def deep_unique_count(*seqs):
    pile = list(x for seq in seqs for x in seq)
    for seq in seqs:
        yield [x for x in seq if pile.count(x) == 1]

These last two approaches are also proposed in @AlainT. answer.

Some timings for these are provided below:

n = 100
m = 100
s = tuple([random.randint(0, 10 * n * m) for _ in range(n)] for _ in range(m))
for func in funcs:
    print(func.__name__)
    %timeit list(func(*s))
    print()

# deep_unique_set
# 10 loops, best of 3: 86.2 ms per loop

# deep_unique_preset
# 10 loops, best of 3: 57.3 ms per loop

# deep_unique_count
# 1 loop, best of 3: 1.76 s per loop

# deep_unique_repeat
# 1000 loops, best of 3: 1.87 ms per loop

# deep_unique_counter
# 100 loops, best of 3: 2.32 ms per loop

Alain T. · Answer 5 · 2020-04-02T12:35:28.260

Counter (from collections) is the way to go for this:

from collections import Counter

s1 = ['a','b','c']
s2 = ['a','potato','d']
s3 = ['a','b','h']
strings=[s1,s2,s3]

counts  = Counter(s for sList in strings for s in sList)
uniques = [ [s for s in sList if counts[s]==1] for sList in strings ]

print(uniques) # [['c'], ['potato', 'd'], ['h']]

If you're not allowed to use an imported module, you could do it with the list's count() method but it would be much less efficient:

allStrings = [ s for sList in strings for s in sList ]
unique     = [[ s for s in sList if allStrings.count(s)==1] for sList in strings]

This can be made more efficient using a set to identify repeated values:

allStrings = ( s for sList in strings for s in sList )
seen       = set()
repeated   = set( s for s in allStrings if s in seen or seen.add(s))
unique     = [ [ s for s in sList if s not in repeated] for sList in strings ]

Find duplicate items in all lists of a list of lists and remove them

5 Answers5