5

I have a list of dictionaries, but some of them are duplicates and I want to remove them (the duplicates).

The keys of the dict are a sequential number.

An example is the following:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
{3: {a:[1,2,3], b: 4}},
.....,
{1000: {a:[2,5,1], b: 99}},
]

Considering the previous example I would like to obtain:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99}},
]

In fact the dictionaries with keys 1 and 3 are identically in their values.

I tried with a set, but since dict is a not hashable type I am not able to do so.

How can i fix the problem?

EDIT

In my case the number of items inside the dict is not fix, so I can have:

[{1: {a:[1,2,3], b: 4}},
{2: {a:[4,5,6], d: 5}},
.....,
{1000: {a:[2,5,1], b: 99, c:["a","v"]}},
]

where the dict with keys 100 has three elements inside insted of two as the other shown.

A1010
  • 360
  • 5
  • 18
  • One way would be to recursively convert each `dict` to an immutable equivalent, e.g. convert `list`s to `tuple`s, `set`s to `frozenset`s, and `dict`s to *sorted* `tuple`s of key-value pairs. – meowgoesthedog May 09 '19 at 15:54
  • Possible duplicate of [How do you remove duplicates from a list whilst preserving order?](https://stackoverflow.com/questions/480214/how-do-you-remove-duplicates-from-a-list-whilst-preserving-order) – a_guest May 09 '19 at 15:55
  • You should use `sorted(d.items())` or whatever is your criterion of equality when adding to the set. – a_guest May 09 '19 at 15:56
  • @a_guest the time complexity of most answers in that post is O(n^2) which is massively inefficient. – meowgoesthedog May 09 '19 at 15:56
  • @meowgoesthedog [This answer](https://stackoverflow.com/a/480227/3767239) has O(n) time complexity and it is the most upvoted. – a_guest May 09 '19 at 15:58
  • @a_guest you can't add `dict`s to a `set` as they are mutable, which is the premise of OP's question. – meowgoesthedog May 09 '19 at 15:58
  • @meowgoesthedog That's why we both suggested converting to immutable types, I suppose. – a_guest May 09 '19 at 16:00
  • @a_guest the answer you linked does not mention mutability; maybe you meant another one on the same page? – meowgoesthedog May 09 '19 at 16:02
  • @meowgoesthedog Sure it does require a bit of extra work, similar to [this one](https://stackoverflow.com/a/1600806/3767239). – a_guest May 09 '19 at 16:07
  • 1
    @meowgoesthedog But converting mutable types to their immutable equivalents would mean that there would be no way to distinguish the two, so `[1, 2]` would be considered a duplicate to `(1, 2)`, which may not be desirable. – blhsing May 09 '19 at 16:18
  • @meowgoesthedog That will not solve the problem of `[1, 2]` being treated as a duplicate to `(1, 2)`, since `[1, 2]` is converted to `(1, 2)` as a key. – blhsing May 09 '19 at 16:21
  • 1
    @blhsing sorry I realised what you meant afterwards, yes you are right. Tagging the objects with their original types could work as a resolution, although it would further complicate things. – meowgoesthedog May 09 '19 at 16:22

5 Answers5

5

To get around the limitation of @jdehesa's solution, where [1, 2] would be treated as a duplicate as (1, 2), you can preserve the data types by using pprint.pformat instead to serialize the data structure. Since pprint.pformat sorts dicts by keys and sets by items, {1: 2, 3: 4} would be properly considered the same as {3: 4, 1: 2}, but [1, 2] would not be considered a duplicate to (1, 2):

from pprint import pformat
lst = [
    {1: {'a': [1, 2, 3], 'b': 4}},
    {2: {'a': [4, 5, 6], 'd': 5}},
    {3: {'b': 4, 'a': [1, 2, 3]}},
    {4: {'a': (4, 5, 6), 'd': 5}},
]
seen = set()
output = []
for d in lst:
    for k, v in d.items():
        signature = pformat(v)
        if signature not in seen:
            seen.add(signature)
            output.append({k: v})

output becomes:

[{1: {'a': [1, 2, 3], 'b': 4}},
 {2: {'a': [4, 5, 6], 'd': 5}},
 {4: {'a': (4, 5, 6), 'd': 5}}]
blhsing
  • 91,368
  • 6
  • 71
  • 106
  • Using `signature = pformat(v, indent=0, width=int(1e9), compact=True)` is going to give better performance especially for deep and long objects. Also this solution is likely to face issues with custom types since `pformat` just uses the object's `__repr__` in that case (that might lead to false duplicates as well as omissions depending on whether the type implements that method or not). If you're only dealing with built-in types however, it's a decent solution. – a_guest May 09 '19 at 18:45
3

You can maybe use a function like this to turn your objects into something hashable:

def make_hashable(o):
    if isinstance(o, dict):
        return frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        return tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        return frozenset(make_hashable(elem) for elem in o)
    else:
        return o

Then you keep a set of seen objects and keep only the keys of each dictionary containing objects that you did not see before:

lst = [
    {1: {'a':[1,2,3], 'b': 4}},
    {2: {'a':[4,5,6], 'd': 5}},
    {3: {'a':[1,2,3], 'b': 4}},
]

seen = set()
result_keys = []
for elem in lst:
    keep_keys = []
    for k, v in elem.items():
        v_hashable = make_hashable(v)
        if v_hashable not in seen:
            seen.add(v_hashable)
            keep_keys.append(k)
    result_keys.append(keep_keys)
result = [{k: elem[k] for k in keys} for elem, keys in zip(lst, result_keys) if keys]
print(result)
# [{1: {'a': [1, 2, 3], 'b': 4}}, {2: {'a': [4, 5, 6], 'd': 5}}]

Note that, as blhsing notes in the comments, this has some limitations, such as considering (1, 2) and [1, 2] equals, as well as {1: 2} and {(1, 2)}. Also, some types may not be convertible to an equivalent hashable type.

EDIT: As a_guest suggests, you can work around the type ambiguity by returning the type itself along with the hashable object in make_hashable:

def make_hashable(o):
    t = type(o)
    if isinstance(o, dict):
        o = frozenset((k, make_hashable(v)) for k, v in o.items())
    elif isinstance(o, list):
        o = tuple(make_hashable(elem) for elem in o)
    elif isinstance(o, set):
        o = frozenset(make_hashable(elem) for elem in o)
    return t, o

If you don't need to look into the hashable object, this will easily provide strict type comparison. Note in this case even things like {1, 2} and frozenset({1, 2}) will be different.

jdehesa
  • 58,456
  • 7
  • 77
  • 121
  • 1
    In `make_hashable` you can return tuples `type(o), ...` in order to avoid that type ambiguity. – a_guest May 09 '19 at 16:36
  • 1
    @a_guest Thanks, you're right, that's a good workaround, I added it in the answer. – jdehesa May 09 '19 at 17:08
  • I edit the question since the number of item inside each dict is not fixed, so the loop that assumes that the number of items inside each dict is two will not work properly. – A1010 May 10 '19 at 09:53
  • @AndreaM In the code I posted I don't assume any particular dictionary structure (in fact it would work even if each dictionary contained more than one key). Is this not working for you as it is now? – jdehesa May 10 '19 at 09:57
  • When you use `for k, v in elem.items():` aren't you assuming that the number of items inside the dict are two? If I try with dict of different size it doesn't return the right result (it seems to split the keys inside the nested dictionaries, as _a_ or _b_) – A1010 May 10 '19 at 10:04
  • @AndreaM `elem.items()` iterates over every key/value pair in the dict. In this case, `elem` here is the top level dict in the list (e.g. `{1: {'a':[1,2,3], 'b': 4}}`), and in the example all of these have just one entry (key `1`, key `2`, key `3`). The "internal" dicts are transformed recursively by `make_hashable`. Can you give an example input where it does not work? – jdehesa May 10 '19 at 10:26
  • you are right, your solution works! I only have to fix the main _for_ loop (in the "real case" the data structure that I show in the example is inside another dictionary). Thanks! – A1010 May 10 '19 at 12:35
2

This is the simplest solution I've been able to come up with assuming the nested dictionary like

{1: {'a': [1,2,3,5,79], 'b': 234 ...}}

as long as the only container inside the dictionary is a list like {'a': [1,2,3..]} then this will work. Or you can just add a simple check like the function below will show.


def serialize(dct):  # this is the sub {'a': [1,2,3]} dictionary
    tmp = []
    for value in dct.values():
        if type(value) == list:
            tmp.append(tuple(value))
        else:
            tmp.append(value)
    return tuple(tmp)

def clean_up(lst):
    seen = set()
    clean = []
    for dct in lst:
        # grabs the 1..1000 key inside the primary dictionary
        # assuming there is only 1 key being the "id" or the counter etc...
        key = list(dct.keys())[0] 
        serialized = serialize(dct[key])
        if serialized not in seen:
            seen.add(serialized)
            clean.append(dct)
    return clean

So the function serialize grabs the nested dictionary and creates a simple tuple from the contents. This is then checked if its in the set "seen" to verify its unique.

benchmarks

generate a data set using some random values just because

lst = []
for i in range(1,1000):
    dct = {
        i: {
            random.choice(string.ascii_letters): [n for n in range(random.randint(0,i))], 
            random.choice(string.ascii_letters): random.randint(0,i)
        }
    }
    lst.append(dct)

Running the benchmarks:


%timeit clean_up(lst)
3.25 ms ± 17.1 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit jdhesa(lst)
126 ms ± 606 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

As seen the function clean_up is significantly faster but simpler (not necessarily a good thing) in its implementation checks.

William Bright
  • 535
  • 3
  • 7
1

You can define a custom hash of your dictionaries by subclassing dict:

class MyData(dict):

    def __hash__(self):
        return hash((k, repr(v)) for k, v in self.items())

l = [
    {1: {'a': [1, 2, 3], 'b': 4}},
    {2: {'a': [4, 5, 6], 'd': 5}},
    {3: {'b': 4, 'a': [1, 2, 3]}},
    {4: {'a': (4, 5, 6), 'd': 5}},
]

s = set([MyData(*d.values()) for d in l])

This is assuming that all the dictionaries in the list have only one key-value pair.

Jacques Gaudin
  • 15,779
  • 10
  • 54
  • 75
  • 1
    Assuming hes perfectly aware of all the keys at all times though! – William Bright May 09 '19 at 18:18
  • Unfortunately it is not my case. I edit the question to point out this problem. I don't know previously the keys of the dict, and it could be very very long, so I would like to avoid a solution where I have to manually code all the possible keys (that I may know) – A1010 May 10 '19 at 10:10
0

I don't know how big is your list and how many duplicates are in it, but, just in case, here is the basic solution.
It might be not efficient, but you don't have to worry about elements type:

import datetime as dt

data = [
    {1: {"b": 4, "a":[1,2,3]}},
    {2: {"a":[4,5,6], "d": 5}},
    {3: {"a":[1,2,3], "b": 4}},
    {4: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
    {5: {'a': dt.datetime(2019, 5, 10), 'd': set([4])}},
    {6: {"a":[2,5,1], "b": 99}},
    {7: {"a":[5,2,1], "b": 99}},
    {8: {"a":(5,2,1), "b": 99}}
]



seen = []
output = []
for d in data:
    for k, v in d.items():
        if v not in seen:
            seen.append(v)
            output.append({k:v})

>>> print(output)
[{1: {'a': [1, 2, 3], 'b': 4}},
 {2: {'a': [4, 5, 6], 'd': 5}},
 {4: {'a': datetime.datetime(2019, 5, 10, 0, 0), 'd': {4}}},
 {6: {'a': [2, 5, 1], 'b': 99}},
 {7: {'a': [5, 2, 1], 'b': 99}},
 {8: {'a': (5, 2, 1), 'b': 99}}]
Lante Dellarovere
  • 1,838
  • 2
  • 7
  • 10