How to efficiently build a list of dicts without duplicates?

Question

In the code below, the 3rd and 4th elements are considered the same, because 'start' and 'end' are just switched:

{'start': '222',  'end': '333', 'type':'c'},
{'start': '333',  'end': '222', 'type':'c'}

I need to build a relations list or set which don't contain duplicates like above. Supposed the input is from list_of_dicts, and my code is the following to achieve the purpose:

relations = []
list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

duplicate_keys = set()
for my_dict in list_of_dicts:
    duplicate_key = ''.join(sorted(my_dict['start'] + my_dict['end'] + my_dict['type']))
    if duplicate_key not in duplicate_keys:
        relations.append(my_dict)
        duplicate_keys.add(duplicate_key)

print(relations)

This seems to work. My list_of_dicts are supposed to be large, for example, 100 millions. Is this the fast way to do it? Also, the list_of_dicts here are illustrative purpose for convenience, but the 'relations' list are built from similar input.

If you want to be able to add them to sets, or use them as dict keys, see [`frozendict`](https://pypi.org/project/frozendict/). — Charles Duffy, Nov 09 '21 at 23:24
No need to sort. Just loop through and add both permutations as dict keys. — user2263572, Nov 09 '21 at 23:24
@user2263572, ...eh? No. They're _the same_. The ordering is not part of the dict itself. `{'a': 1, 'b': 2} == {'b': 2, 'a': 1}` is True. There's no reason whatsoever to add them as two separate items; and adding permutations is going to greatly increase memory usage. — Charles Duffy, Nov 09 '21 at 23:25
@CharlesDuffy He wants to treat `start: 1, end: 2` as equivalent to `start: 2, end: 1`. It has nothing to do with dictionary ordering. — Barmar, Nov 09 '21 at 23:26
_Ahhh_. In that case we can just sort the values and ignore the keys (for which ordering is independent). Make both `start: 1, end:2` and `start: 2, end: 1` evaluate to `lower: 1, higher: 2` and you're done. — Charles Duffy, Nov 09 '21 at 23:27
@CharlesDuffy But only start and end are considered, not type. — Barmar, Nov 09 '21 at 23:27
@Barmar, ...right, hence the "for which ordering is independent" caveat. — Charles Duffy, Nov 09 '21 at 23:28
Also, the list_of_dicts don't exist. Each dict in 'list_of_dicts' are produced in a loop and add to the relations list/ — marlon, Nov 09 '21 at 23:30

score 2 · Accepted Answer · answered Nov 10 '21 at 00:00

I think that is better to transform those dict into a specialize class, add to that class a hash function and let a set or a dict or similar take care of duplicates

>>> class MyObject:
        def __init__(self,start,end,type):
            self.data = (*sorted((start,end)),type)
        def __hash__(self):
            return hash(self.data)
        def __repr__(self):
            return f"{self.__class__.__name__}{self.data}"
        def __eq__(self,other):
            if isinstance(other,self.__class__):
                return self.data == other.data
            return False

    
>>> list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]
>>> new=[MyObject(**x) for x in list_of_dicts]
>>> new
[MyObject('123', '456', 'a'), MyObject('111', '122', 'b'), MyObject('222', '333', 'c'), MyObject('222', '333', 'c')]
>>> set(new)
{MyObject('123', '456', 'a'), MyObject('222', '333', 'c'), MyObject('111', '122', 'b')}
>>>

A.N. · Answer 2 · 2021-11-10T00:25:20.933

0

Try to use dict and tuples for keys:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

for my_dict in list_of_dicts:
    k = tuple(sorted(my_dict.values()))

    if k not in relations:
        relations[k] = my_dict


print(list(relations.values()))

And one-liner with a little-bit different behavior:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]

relations = list({tuple(sorted(my_dict.values())): my_dict for my_dict in list_of_dicts}.values())

print(relations)

In this one-liner, latest value will be used in the dict. This is equivalent to the full version, but without condition if k not in relations:.

Result:

[{'start': '123', 'end': '456', 'type': 'a'}, {'start': '111', 'end': '122', 'type': 'b'}, {'start': '333', 'end': '222', 'type': 'c'}]

edited Nov 10 '21 at 00:25

answered Nov 09 '21 at 23:26

A.N.

278
2
13

What are d1 and d2? There's just a list of dictionaries. – Barmar Nov 09 '21 at 23:29
How does this treat the dictionaries with `start: x, end: y` as equivalent to `start: y, end: x` and remove the duplicates? – Barmar Nov 09 '21 at 23:30
d1, d2, ..., d_n - are your dicts. – A.N. Nov 09 '21 at 23:33
Can you show how this works with a list of dicts? I still don't see how it swaps `start` and `end` when comparing. – Barmar Nov 09 '21 at 23:35
It remove duplicates, like `start: x` and make one dictionary without duplicated key/value pairs. – A.N. Nov 09 '21 at 23:36
Can you test your code and show the results? – Barmar Nov 09 '21 at 23:37
I think you don't really understand the question. – Barmar Nov 09 '21 at 23:39
Look, if you have two dicts (for the list it's similar), you'll have: `{'type': 'c', 'end': '333', 'start': '222'}` – A.N. Nov 09 '21 at 23:42
The result is supposed to be a new list of dictionaries. You're just creating one dictionary. – Barmar Nov 09 '21 at 23:44
Ah, yes, sorry, I didn't understand, what do you need as a result. – A.N. Nov 09 '21 at 23:44
I don't need anything, I didn't write the question. – Barmar Nov 09 '21 at 23:45
The result is supposed to be a list of dictionaries created from the original list, but without pairs of dictionaries that have the same start/end values in either order. – Barmar Nov 09 '21 at 23:46
Corrected. Thank you for the comments, Barmar. – A.N. Nov 10 '21 at 00:07

Charles Duffy · Answer 3 · 2021-11-09T23:37:45.463

To generate a sorting key that's order-independent only for specific, named keys (as initially defined by the sortable_keys default parameter):

def order_independent_key(input_dict, sortable_keys=('start', 'end')):
    local_dict = dict(input_dict)
    sortable_vals = []
    for key in sortable_keys:
        sortable_vals.append(local_dict.pop(key))
    return tuple(sortable_vals) + tuple(sorted(local_dict.items()))

Thus:

list_of_dicts = [{'start': '123',  'end': '456', 'type':'a'},
                  {'start': '111',  'end': '122', 'type':'b'},
                  {'start': '222',  'end': '333', 'type':'c'},
                  {'start': '333',  'end': '222', 'type':'c'},
                  ]
deduplicated_dicts = {}
for item in list_of_dicts:
    deduplicated_dicts[order_independent_key(item)] = item

...will generate a structure like:

{('111', '122', ('type', 'b')): {'end': '122', 'start': '111', 'type': 'b'},
 ('123', '456', ('type', 'a')): {'end': '456', 'start': '123', 'type': 'a'},
 ('222', '333', ('type', 'c')): {'end': '333', 'start': '222', 'type': 'c'},
 ('333', '222', ('type', 'c')): {'end': '222', 'start': '333', 'type': 'c'}}

...for which doing a lookup by order_independent_key(another_dict) is an operation with performance characteristics that scale only with the number of keys in another_dict.

user2263572 · Answer 4 · 2021-11-09T23:45:46.413

0

Something like this would be quite efficient for looping through a list of dicts and determining which are dupes (add anything to a set that is considered "seen"). Would require looping through each dict once, but constant time lookup to see if a key exists.

seen = set()

for my_dict in list_of_dicts:
    if (my_dict['start'], my_dict['end']) in seen:
        # dupe
        continue
    # not dupe
    seen.add((my_dict['start'], my_dict['end']))
    seen.add((my_dict['end'], my_dict['start']))

edited Nov 09 '21 at 23:45

answered Nov 09 '21 at 23:33

user2263572

5,435
5
35
57

Yes, define 'seen' as a set should be fine. – marlon Nov 09 '21 at 23:37
Is this going to create a huge seen set? double of the original list_of_dicts? – marlon Nov 09 '21 at 23:40
@Barmar the dict seemed a little more explicit, but agree a set is more Pythonic. My understanding is that performance should be similar with either. – user2263572 Nov 09 '21 at 23:41
Yes. A set is essentially the same as a dict with hidden values. – Barmar Nov 09 '21 at 23:42
@marlon depends how many dupes there are...couldn't tell ya – user2263572 Nov 09 '21 at 23:43
If I define 'seen' as an instance variable, then self.seen.add(...) would be problematic in a multiprocessing program? I have 10 processes to use this function. – marlon Nov 09 '21 at 23:54
@marlon there are better resources available than anything I'd write in a comment. https://stackoverflow.com/questions/6832554/multiprocessing-how-do-i-share-a-dict-among-multiple-processes – user2263572 Nov 10 '21 at 00:00

How to efficiently build a list of dicts without duplicates?

4 Answers4