Identify duplicate lists in list of lists?

Question

How can I compare a list of lists with itself in python in order to:

identify identical sublists with the same items (not necessarily in the same item order)
delete these duplicate sublists

Example:

list = [ [1, 3, 5, 6], [7, 8], [10, 12], [9], [3, 1, 5, 6], [12, 10] ]

clean_list = [ [1, 3, 5, 6], [7, 8], [10, 12], [9] ]

Any help is greatly appreciated. I can't seem to figure this out.

you should avoid built-ins names for your objects like `list` — Azat Ibrakov, May 30 '17 at 14:57
Check out this question https://stackoverflow.com/questions/2213923/python-removing-duplicates-from-a-list-of-lists — Brandon Deo, May 30 '17 at 15:03
@AzatIbrakov Thanks for pointing this out, although as stated above it's only a generic example, not an extract of the original script, with obscure variable names that no one would have understood out of context. — St4rb0y, May 31 '17 at 15:58
@2ps I'm a Python beginner and didn't know where to begin to tackle the problem. I'm glad that so many people provided helpful answers that I can learn from. Thanks! — St4rb0y, May 31 '17 at 15:58

Jean-François Fabre · Accepted Answer · 2017-05-30T15:27:04.800

2

I would rebuild the "clean_list" in a list comprehension, checking that the sorted version of the sublist isn't already in the previous elements

the_list = [ [1, 3, 5, 6], [7, 8], [10, 12], [9], [3, 1, 5, 6], [12, 10] ]

clean_list = [l for i,l in enumerate(the_list) if all(sorted(l)!=sorted(the_list[j]) for j in range(0,i))]

print(clean_list)

of course, sorting the items for each iteration is time consuming, so you could prepare a sorted list of sublists:

the_sorted_list = [sorted(l) for l in the_list]

and use it:

clean_list = [the_list[i] for i,l in enumerate(the_sorted_list) if all(l!=the_sorted_list[j] for j in range(0,i))]

result (in both cases):

[[1, 3, 5, 6], [7, 8], [10, 12], [9]]

As many suggested, maybe a simple for loop (no list comprehension there) storing the already seen items in a set would be more performant for the lookup of the duplicates. That alternate solution could be necessary if the input list is really big to avoid the O(n) lookup of all.

An example of implementation could be:

test_set = set()
clean_list = []

for l in the_list:
    sl = sorted(l)
    tsl = tuple(sl)
    if not tsl in test_set:
        test_set.add(tsl)  # note it down to avoid inserting it next time
        clean_list.append(sl)

edited May 30 '17 at 15:27

answered May 30 '17 at 15:02

Jean-François Fabre

137,073
23
153
219

can someone explain the downvote? (apart from writing some code for the user) – Jean-François Fabre May 30 '17 at 15:03
1

Not my down vote, but isn't this a case where a humble `for` loop would be better, checking for the `tuple(sorted(sublist))` in a `set()` of previously seen items? – Chris_Rands May 30 '17 at 15:08
1

SO Things... Sometimes you do your best to help people and a couple of haters downvote your answer anyways. – Gabriel Belini May 30 '17 at 15:14
@Chris_Rands good point. At least this works :) I don't want to steal it from you, I think you could answer this. – Jean-François Fabre May 30 '17 at 15:15
1

@gabrielbelini in that case I think the guy downvoted the answer _and_ the question. – Jean-François Fabre May 30 '17 at 15:15
@Jean-FrançoisFabre Thank you, your answer was very helpful. Don't mind the haters. No down vote deserved here!!! – St4rb0y May 31 '17 at 16:01

score 0 · Answer 2 · answered May 30 '17 at 15:10

0

Create a set. Then for each list in the list, sort it, transform into tuple, then insert into set.

setOfLists = set()
for list in listOfLists:
    list.sort()
    setOfLists.add(tuple(list))

print setOfLists

You can retransform the tuples in the set into lists again.

answered May 30 '17 at 15:10

Eric

730
4
16

the only issue here: the `setOfLists` won't preserve the order of the sublists. otherwise, OK answer (same as what Chris Rands suggested, except that Chris suggested to keep the set as a marker, but still store the sublists in a list to preserve order) – Jean-François Fabre May 30 '17 at 15:17
also `list.sort() setOfLists.add(tuple(list))` => `setOfLists.add(tuple(sorted(list)))` – Jean-François Fabre May 30 '17 at 15:18
also: sorting the sublists may not be what the OP wants. Using an aux set with sorted items would be a good solution. – Jean-François Fabre May 30 '17 at 15:24

score 0 · Answer 3 · answered May 30 '17 at 15:29

0

Simple for loops will work, but if your dataset is small, e.g. 1k or less, you can use this :

b = []
[b.append(i) for i in a if len([j for j in b if set(j) == set(i)])==0 ]

print b

answered May 30 '17 at 15:29

WoooHaaaa

19,732
32
90
138

score 0 · Answer 4 · answered May 30 '17 at 16:11

So heres my take on this.

I def a function that sorts each sublist and appends to a temp list. then I check if the sublist in temp_my_list is 'not' in temp_clean_list and if not then append to new list. this should work for any 2 sets of list. I added some extra list to show some kind of result other than an empty string.

my_list = [[1, 3, 5, 6], [7, 8], [10, 12], [9], [3, 1, 5, 6], [12, 10],[16]]
clean_list = [ [1, 3, 5, 6], [7, 8], [10, 12], [9],[18]]
new_list = []

def getNewList():
    temp_my_list = []
    temp_clean_list = []
    for sublist in my_list:
        sublist.sort()
        temp_my_list.append(msublist)
    for sublist in clean_list:
        sublist.sort()
        temp_clean_list.append(sublist)
    for sublist in temp_my_list:
        if sublist not in temp_clean_list:
            new_list.append(sublist)

getNewList()
print (new_list)

Resulit:

[[16]]

Identify duplicate lists in list of lists?

4 Answers4