6

I have nested dictionaries that may contain other dictionaries or lists. I need to be able to compare a list (or set, really) of these dictionaries to show that they are equal.

The order of the list is not uniform. Typically, I would turn the list into a set, but it is not possible since there are values that are also dictionaries.

a = {'color': 'red'}
b = {'shape': 'triangle'}
c = {'children': [{'color': 'red'}, {'age': 8},]}

test_a = [a, b, c] 
test_b = [b, c, a]

print(test_a == test_b)  # False
print(set(test_a) == set(test_b))  # TypeError: unhashable type: 'dict'

Is there a good way to approach this to show that test_a has the same contents as test_b?

Abe Clark
  • 166
  • 1
  • 10
  • You're going to need to write a custom function to traverse your data structures and determine equality. There is no built-in capability for this. – skrrgwasme Aug 04 '16 at 18:19
  • are the inner lists uniform? Or do their elements can be in arbitrary order as well? Do they nest mora than two levels (i.e., the conatined elements and lists can contain other lists and dictionaries themselves) - or are all elements on the lists you are comparing shallow? – jsbueno Aug 04 '16 at 21:59
  • @uoɥʇʎPʎzɐɹC: Less trivial than you might think, if you want it to be efficient and not fragile. – user2357112 Aug 04 '16 at 22:14
  • 1
    How did you end up with such different formats? – Padraic Cunningham Aug 04 '16 at 22:16
  • @jsbueno yes, elements can be in arbitrary order, yes they can contain other lists or dictionaries – Abe Clark Aug 04 '16 at 23:51
  • @PadraicCunningham comparing two xml documents where order is not significant as part of a test suite – Abe Clark Aug 04 '16 at 23:52
  • see my updated answer - two solutions, test their speed against other solutions **for real data**, not toy data. Asymptotically, please. – noɥʇʎԀʎzɐɹƆ Aug 05 '16 at 00:47
  • you should plot the speeds relative to the size of the data and see the growth rate of each – noɥʇʎԀʎzɐɹƆ Aug 05 '16 at 00:52

5 Answers5

4

You can use a simple loop to check if each of one list is in the other:

def areEqual(a, b):
    if len(a) != len(b):
        return False

    for d in a:
        if d not in b:
            return False
    return True
jsbueno
  • 99,910
  • 10
  • 151
  • 209
m_callens
  • 6,100
  • 8
  • 32
  • 54
  • 1
    You should check the other direction too – Ohad Eytan Aug 04 '16 at 18:21
  • comsider `a=[1]` and `b=[1,2]`, your function return `True` – Ohad Eytan Aug 04 '16 at 18:23
  • test it with [a, b, c] and [a, b, c, d] you will see :) – Cabu Aug 04 '16 at 18:24
  • Thank you, this does make sense for the example I provided. In my actual scenario, there may be additional nested arrays deeper that would keep this from working (since order matters) – Abe Clark Aug 04 '16 at 21:04
  • you are aware this function cam be quite inefficient for large lists, don't you? – jsbueno Aug 04 '16 at 21:47
  • Anyway, I was thinking on the proposed solution with "sorted" - but those don't work because (1) the lists have heteerogenous elements, and (2) one can't easily sort "dicts" themselves. So, yes, this is a good answer that will work! – jsbueno Aug 04 '16 at 21:56
  • Because the "is" is not the key here - the O(n²) comparison is. :-) We can fix the `is` – jsbueno Aug 04 '16 at 22:10
  • Anyway, I had upvoted, and had skipped the "is" part. As I don't think it is that important to the case, I just edited it to the "one obvious way to do it". – jsbueno Aug 04 '16 at 22:13
  • you have a PEP 8 violation. you will be terminated. – noɥʇʎԀʎzɐɹƆ Aug 05 '16 at 15:12
2

I suggest writing a function that turns any Python object into something orderable, with its contents, if it has any, in sorted order. If we call it canonicalize, we can compare nested objects with:

canonicalize(test_a) == canonicalize(test_b)

Here's my attempt at writing a canonicalize function:

def canonicalize(x):
    if isinstance(x, dict):
        x = sorted((canonicalize(k), canonicalize(v)) for k, v in x.items())
    elif isinstance(x, collections.abc.Iterable) and not isinstance(x, str):
        x = sorted(map(canonicalize, x))
    else:
        try:
            bool(x < x) # test for unorderable types like complex
        except TypeError:
            x = repr(x) # replace with something orderable
    return x

This should work for most Python objects. It won't work for lists of heterogeneous items, containers that contain themselves (which will cause the function to hit the recursion limit), nor float('nan') (which has bizarre comparison behavior, and so may mess up the sorting of any container it's in).

It's possible that this code will do the wrong thing for non-iterable, unorderable objects, if they don't have a repr function that describes all the data that makes up their value (e.g. what is tested by ==). I picked repr as it will work on any kind of object and might get it right (it works for complex, for example). It should also work as desired for classes that have a repr that looks like a constructor call. For classes that have inherited object.__repr__ and so have repr output like <Foo object at 0xXXXXXXXX> it at least won't crash, though the objects will be compared by identity rather than value. I don't think there's any truly universal solution, and you can add some special cases for classes you expect to find in your data if they don't work with repr.

Blckknght
  • 100,903
  • 11
  • 120
  • 169
0

In this case they are the same dicts so you can compare ids (docs). Note that if you introduced a new dict whose values were identical it would still be treated differently. I.e. d = {'color': 'red'} would be treated as not equal to a.

sorted(map(id, test_a)) == sorted(map(id, test_b))

As @jsbueno points out, you can do this with the kwarg key.

sorted(test_a, key=id) == sorted(test_b, key=id)
Alex
  • 18,484
  • 8
  • 60
  • 80
  • 1
    thank you. Unfortunately in the real case I am not comparing the same dicts, but rather two dicts with the same contents – Abe Clark Aug 04 '16 at 19:36
  • Besides not resolving the problem, this usage of "sorted" is incorrect unless one really wants the IDs - you should be using the `key` argument to sorted, instead of using `map` on the source lists. – jsbueno Aug 04 '16 at 22:08
0

If the elements in both lists are shallow, the idea of sorting them, and then comparing with equality can work. The problem with @Alex's solution is that he is only using "id" - but if instead of id, one uses a function that will sort dictionaries properly, things shuld just work:

def sortkey(element):
   if isinstance(element, dict):
         element = sorted(element.items())
   return repr(element)

sorted(test_a, key=sortkey) == sorted(test_b, key=sotrkey) 

(I use an repr to wrap the key because it will cast all elements to string before comparison, which will avoid typerror if different elements are of unorderable types - which would almost certainly happen if you are using Python 3.x)

Just to be clear, if your dictionaries and lists have nested dictionaries themselves, you should use the answer by @m_callens. If your inner lists are also unorderd, you can fix this to work, jsut sorting them inside the key function as well.

jsbueno
  • 99,910
  • 10
  • 151
  • 209
-2

An elegant and relatively fast solution:

class QuasiUnorderedList(list):
    def __eq__(self, other):
        """This method isn't as ineffiecient as you think! It runs in O(1 + 2 + 3 + ... + n) time, 
        possibly better than recursively freezing/checking all the elements."""
        for item in self:
            for otheritem in other:
                if otheritem == item:
                    break
            else:
                # no break was reached, item not found.
                return False
        return True

This runs in O(1 + 2 + 3 + ... + n) flat. While slow for dictionaries of low depth, this is faster for dictionaries of high depth.

Here's a considerably longer snippet which is faster for dictionaries where depth is low and length is high.

class FrozenDict(collections.Mapping, collections.Hashable):  # collections.Hashable = portability
    """Adapated from http://stackoverflow.com/a/2704866/1459669"""

    def __init__(self, *args, **kwargs):
        self._d = dict(*args, **kwargs)
        self._hash = None

    def __iter__(self):
        return iter(self._d)

    def __len__(self):
        return len(self._d)

    def __getitem__(self, key):
        return self._d[key]

    def __hash__(self):
        # It would have been simpler and maybe more obvious to
        # use hash(tuple(sorted(self._d.iteritems()))) from this discussion
        # so far, but this solution is O(n). I don't know what kind of
        # n we are going to run into, but sometimes it's hard to resist the
        # urge to optimize when it will gain improved algorithmic performance.
        # Now thread safe by CrazyPython
        if self._hash is None:
            _hash = 0
            for pair in self.iteritems():
                _hash ^= hash(pair)
        self._hash = _hash
        return _hash


def freeze(obj):
    if type(obj) in (str, int, ...):  # other immutable atoms you store in your data structure
        return obj
    elif issubclass(type(obj), list):  # ugly but needed
        return set(freeze(item) for item in obj)
    elif issubclass(type(obj), dict):  # for defaultdict, etc.
        return FrozenDict({key: freeze(value) for key, value in obj.items()})
    else:
        raise NotImplementedError("freeze() doesn't know how to freeze " + type(obj).__name__ + " objects!")


class FreezableList(list, collections.Hashable):
    _stored_freeze = None
    _hashed_self = None

    def __eq__(self, other):
        if self._stored_freeze and (self._hashed_self == self):
            frozen = self._stored_freeze
        else:
            frozen = freeze(self)
        if frozen is not self._stored_freeze:
            self._stored_hash = frozen
        return frozen == freeze(other)

    def __hash__(self):
        if self._stored_freeze and (self._hashed_self == self):
            frozen = self._stored_freeze
        else:
            frozen = freeze(self)
        if frozen is not self._stored_freeze:
            self._stored_hash = frozen
        return hash(frozen)


class UncachedFreezableList(list, collections.Hashable):
    def __eq__(self, other):
        """No caching version of __eq__. May be faster.
        Don't forget to get rid of the declarations at the top of the class!
        Considerably more elegant."""
        return freeze(self) == freeze(other)

    def __hash__(self):
        """No caching version of __hash__. See the notes in the docstring of __eq__2"""
        return hash(freeze(self))

Test all three (QuasiUnorderedList, FreezableList, and UncachedFreezableList) and see which one is faster in your situation. I'll betcha it's faster than the other solutions.

noɥʇʎԀʎzɐɹƆ
  • 9,967
  • 2
  • 50
  • 67
  • `O(1+2+3+...+n)` time is `O(n**2)` time. You also have `QuasiUnorderedList([1, 2]) == [1, 2, 3]`, and your implementation is generally slower than the one m_callens posted earlier. – user2357112 Aug 05 '16 at 06:20
  • @user2357112 Yes, worst case `O(n**2)` time. But in practice, is `1+2+3` equal to `3**2`? – noɥʇʎԀʎzɐɹƆ Aug 05 '16 at 14:39
  • `1+2+3>(3**2)/2`, and in general, `1+2+3+...+n==(n**2+n)/2`. If you want to argue constant factors, you still have a worse constant factor than what m_callens posted. – user2357112 Aug 05 '16 at 15:06