Quickest way to dedupe list in dict

Question

I have a dict containing lists and need a fast way to dedupe the lists.

I know how to dedupe a list in isolation using the set() function, but in this case I want a fast way of iterating through the dict, deduping each list on the way.

hello = {'test1':[2,3,4,2,2,5,6], 'test2':[5,5,8,4,3,3,8,9]}

I'd like it to appear like;

hello = {'test1':[2,3,4,5,6], 'test2':[5,8,4,3,9]}

Though I don't necessarily need to have the original order of the lists preserved.

I've tried using a set like this, but it's not quite correct (it's not iterating properly and I'm losing the first key)

for key, value in hello.items(): goodbye = {key: set(value)}
>>> goodbye
{'test2': set([8, 9, 3, 4, 5])}

EDIT: Following PM 2Ring's comment below, I'm now populating the dict differently to avoid duplicates in the first place. Previously I was using lists, but using sets prevents dupes to be appended by default;

>>> my_numbers = {}
>>> my_numbers['first'] = [1,2,2,2,6,5]
>>> from collections import defaultdict
>>> final_list = defaultdict(set)
>>> for n in my_numbers['first']: final_list['test_first'].add(n)
... 
>>> final_list['test_first']
set([1, 2, 5, 6])

As you can see, the final output is a deduped set, as required.

Not interested in preserving order, and also my lists are part of a dict. — John Honan, Jul 17 '15 at 15:33
If you don't need to preserve order, then go for a set-based approach: it's more efficient than list-based approaches, but you may not notice much of a speed difference if your lists are small or if they have few dupes. And consider storing sets rather than lists as your dict's values. OTOH, for _very_ small lists, the set-based way may be _slower_. Also, sets use a little more memory than lists. — PM 2Ring, Jul 17 '15 at 15:39
@PM2Ring I went with using sets in the end, so avoided dupes in the first place. defaultdict(set) and .add(n) to append to the set did the trick. Q updated. — John Honan, Jul 18 '15 at 22:08

score 5 · Answer 1 · edited May 23 '17 at 10:26

5

You can use a list comprehension with a deduplicate function that preserves the order:

def deduplicate(seq):
    seen = set()
    seen_add = seen.add
    return [ x for x in seq if not (x in seen or seen_add(x))]

{key: deduplicate(value) for key, value in hello.items()}

edited May 23 '17 at 10:26

Community

1
1

answered Jul 17 '15 at 15:23

enrico.bacis

30,497
10
86
115

Just as a note: dict comprehensions don't work with older Python versions like 2.7. – Daniel Jul 17 '15 at 15:25
1

@Daniel it works in Python 2.7 , it does not work in any version older than that. – Anand S Kumar Jul 17 '15 at 15:25
@AnandSKumar You're right. I messed that up. – Daniel Jul 17 '15 at 15:29
I'm on 2.7.10 and this approach works great. Thanks.The main advantage over assigning the values as set() (in the other answer) is that this keeps them as a list, and also retains the original order, is that correct? – John Honan Jul 17 '15 at 15:29

score 5 · Accepted Answer · edited Jul 17 '15 at 16:13

5

It's not iterating wrong, you're just assigning goodbye as a new dict each time. You need to assign as an empty dict then assign the values to keys in each iteration.

goodbye = {}
for key, value in hello.items(): goodbye[key] = set(value)
>>> goodbye
{'test1': set([2, 3, 4, 5, 6]), 'test2': set([8, 9, 3, 4, 5])}

Also since sets don't preserve order, if you do want to preserve order it's best to make a simple iterating function that will return a new list that skips over already added values.

def uniqueList(li):
    newList = []
    for x in li:
        if x not in newList:
            newList.append(x)
    return newList


goodbye = {}
for key, value in hello.items(): goodbye[key] = uniqueList(value)
>>> goodbye
{'test1': [2, 3, 4, 5, 6], 'test2': [5, 8, 4, 3, 9]}

edited Jul 17 '15 at 16:13

Khez

10,172
2
31
51

answered Jul 17 '15 at 15:23

SuperBiasedMan

9,814
10
45
73

1

I guess that's one of the reasons OP said that his solution is "not quite correct", and order is preserved in the lists (!) in the expected output. – tobias_k Jul 17 '15 at 15:27
OP here. Actually, preserving the order doesn't matter. So this approach works fine for my purposes as well. My solution wasn't quite correct because I was assigning a new value instead of appending, leaving myself with only one key. – John Honan Jul 17 '15 at 15:31
Added a second order preserving solution anyway, for the sake of completeness. – SuperBiasedMan Jul 17 '15 at 15:35

score 3 · Answer 3 · answered Jul 17 '15 at 15:38

3

>>>hello = {'test1':[2,3,4,2,2,5,6], 'test2':[5,5,8,4,3,3,8,9]}    
>>>for key,value in hello.iteritems():
       hello[key] = list(set(value))
>>>hello
{'test1': [2, 3, 4, 5, 6], 'test2': [8, 9, 3, 4, 5]}

answered Jul 17 '15 at 15:38

Kousik

21,485
7
36
59

1

Ah cool, that turns the 'set' back into a list. – John Honan Jul 17 '15 at 15:53

score 0 · Answer 4 · answered Jul 17 '15 at 15:28

This is a more verbose way of doing it, which preserves order and works in all Python versions:

for key in hello:
    s = set()
    l = []
    for subval in hello[key]:
        if subval not in s:
            l.append(subval)
            s.add(subval)
    hello[key] = l

score 0 · Answer 5 · answered Jul 17 '15 at 15:29

0

my_list = [1,2,2,2,3,4,5,6,7,7,7,7,7,8,9,10]
seen = set()
print list(filter(lambda x:x not in seen and not seen.add(x),my_list))

answered Jul 17 '15 at 15:29

Joran Beasley

110,522
12
160
179

Quickest way to dedupe list in dict

5 Answers5