4

I'm trying to port some code from Python 2 to Python 3. It's ugly stuff but I'm trying to get the Python 3 results to be as identical to the Python 2 results as possible. I have code similar to this:

import json

# Read a list of json dictionaries by line from file.

objs = []
with open('data.txt') as fptr:
    for line in fptr:
        objs.append(json.loads(line))

# Give the dictionaries a reliable order.

objs = sorted(objs)

# Do something externally visible with each dictionary:

for obj in objs:
    do_stuff(obj)

When I port this code from Python 2 to Python 3, I get an error:

TypeError: unorderable types: dict() < dict()

So I changed the sorted line to this:

objs = sorted(objs, key=id)

But the ordering of the dictionaries still changed between Python 2 and Python 3.

Is there a way to replicate the Python 2 comparison logic in Python 3? Is it simply that id was used before and is not reliable between Python versions?

Amy Forbes
  • 98
  • 1
  • 6
  • Do you have to replicate Python 2's logic in 3, or do you simply need *some* ordering which is consistent between the two? – DSM Sep 04 '14 at 21:44
  • `sorted(objs, key=sorted)` works if you only care about key comparisons. (and implement it in both 2 and 3, there is really no way to match 2's arbitrary ordering in 3) – roippi Sep 04 '14 at 21:49
  • I'd like to replicate the ordering. – Amy Forbes Sep 04 '14 at 21:49
  • @roippi: Not if the dicts can themselves contain dicts. You need to pass `key=lambda x: sorted(x, key=lambda y: sorted(y, …`. – abarnert Sep 04 '14 at 21:53
  • @abarnert right, my brain somehow glossed over that possibility. – roippi Sep 04 '14 at 21:54
  • The real question here is: why do you want to replicate an arbitrary order in the first place? What advantage do you get from that? Some tests that used to only fail if you run on a different machine or under more load or whatever now fail more consistently… is that a bad thing? – abarnert Sep 04 '14 at 22:14
  • If you read the same json data, then wouldn't the dictionaries be appended to the `objs` list in the same order regardless of which verion of Python you were using? In other words why do you need to sort them into some arbitrary order? – martineau Sep 04 '14 at 22:20
  • Unfortunately, it's not as easy as updating a few tests. It's part of a larger system and the ordering has undue influence. Though the spec says the ordering is arbitrary, clearly CPython 2.7 chose something. I just don't know how to replicate that in CPython 3 (if it's even possible). Seems like this is a case where having the `cmp` keyword for `sorted` in Python 3 would help. – Amy Forbes Sep 04 '14 at 22:22
  • @martineau the order of the lines in the file varies but the observable order from `do_stuff` shouldn't. Hence the need for a "total order" on `dict` objects. – Amy Forbes Sep 04 '14 at 22:25

5 Answers5

4

If you want the same behavior as earlier versions of Python 2.x in both 2.7 (which uses an arbitrary sort order instead) and 3.x (which refuses to sort dicts), Ned Batchelder's answer to a question about how sorting dicts works gets you part of the way there, but not all the way.


First, it gives you an old-style cmp function, not a new-style key function. Fortunately, both 2.7 and 3.x have functools.cmp_to_key to solve that. (You could of course instead rewrite the code as a key function, but that may make it harder to see any differences between the posted code and your code…)


More importantly, it not only doesn't do the same thing in 2.7 and 3.x, it doesn't even work in 2.7 and 3.x. To understand why, look at the code:

def smallest_diff_key(A, B):
    """return the smallest key adiff in A such that A[adiff] != B[bdiff]"""
    diff_keys = [k for k in A if A.get(k) != B.get(k)]
    return min(diff_keys)

def dict_cmp(A, B):
    if len(A) != len(B):
        return cmp(len(A), len(B))
    adiff = smallest_diff_key(A, B)
    bdiff = smallest_diff_key(B, A)
    if adiff != bdiff:
        return cmp(adiff, bdiff)
    return cmp(A[adiff], b[bdiff])

Notice that it's calling cmp on the mismatched values.

If the dicts can contain other dicts, that's relying on the fact that cmp(d1, d2) is going to end up calling this function… which is obviously not true in newer Python.

On top of that, in 3.x cmp doesn't even exist anymore.

Also, this relies on the fact that any value can be compared with any other value—you might get back arbitrary results, but you won't get an exception. That was true (except in a few rare cases) in 2.x, but it's not true in 3.x. That may not be a problem for you if you don't want to compare dicts with non-comparable values (e.g., if it's OK for {1: 2} < {1: 'b'} to raise an exception), but otherwise, it is.

And of course if you don't want arbitrary results for dict comparison, do you really want arbitrary results for value comparisons?

The solution to all three problems is simple: you have to replace cmp, instead of calling it. So, something like this:

def mycmp(A, B):
    if isinstance(A, dict) and isinstance(B, dict):
        return dict_cmp(A, B)
    try:
        return A < B
    except TypeError:
        # what goes here depends on how far you want to go for consistency

If you want the exact rules for comparison of objects of different types that 2.7 used, they're documented, so you can implement them. But if you don't need that much detail, you can write something simpler here (or maybe even just not trap the TypeError, if the exception mentioned above is acceptable).

Community
  • 1
  • 1
abarnert
  • 354,177
  • 51
  • 601
  • 671
0

Is there a way to replicate the Python 2 comparison logic in Python 3? Is it simply that id was used before and is not reliable between Python versions?

id is never "reliable". The id that you get for any given object is a completely arbitrary value; it can be different from one run to the next, even on the same machine and Python version.

Python 2.x doesn't actually document that it sorts by id. All it says is:

Outcomes other than equality are resolved consistently, but are not otherwise defined.

But that just makes the point even better: the order is explicitly defined to be arbitary (except for being consistent during any given run). Which is exactly the same guarantee you get by sorting with key=id in Python 3.x, whether or not it actually works the same way.*

So you're doing the same thing in 3.x. The fact that the two arbitrary orders are different just means that arbitrary is arbitrary.


If you want some kind of repeatable ordering for a dict based on what it contains, you just have to decide what that order is, and then you can build it. For example, you can sort the items in order then compare them (recursively passing down the same key function in case the items are or contain dicts).**

And, having designed and implemented some kind of sensible, non-arbitrary ordering, it will of course work the same way in 2.7 and 3.x.


* Note that it's not equivalent for identity comparisons, only for ordering comparisons. If you're only using it for sorted, this has the consequence that your sort will no longer be stable. But since it's in arbitrary order anyway, that hardly matters.

** Note that Python 2.x used to use a rule similar to this. From the footnote to the above: "Earlier versions of Python used lexicographic comparison of the sorted (key, value) lists, but this was very expensive for the common case of comparing for equality." So, that tells you that it's a reasonable rule—as long as it's actually the rule you want, and you don't mind the performance cost.

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • You lost me at "recursively passing down the same key function" How do I do that? – Amy Forbes Sep 04 '14 at 21:56
  • @AmyForbes: Your key function has to transform all of the values in some way appropriate to the value; for values that are themselves dicts, the key function itself is the way to do that. – abarnert Sep 04 '14 at 22:08
0

The logic in CPython2.x is somewhat complicated since the behavior is dictated by dict.__cmp__. A python implementation can be found here.

However, if you really want a reliable ordering, you'll need to sort on a better key than id. You could use functools.cmp_to_key to transform the comparison function from the linked answer to a key function, but really, it's not a good ordering anyway as it is completely arbitrary.

Your best bet is to sort all of the dictionaries by a field's value (or multiple fields). operator.itemgetter can be used for this purpose quite nicely. Using this as a key function should give you consistent results for any somewhat modern implementation and version of python.

Community
  • 1
  • 1
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • I tried using the Python implementation that you link to but `cmp` is not defined. And what happens in the recursive case where a value is a dictionary that abarnert mentions? – Amy Forbes Sep 04 '14 at 21:55
  • @AmyForbes: `cmp` no longer exists in 3.x, but you can write it as a one-liner based on the [2.x docs](https://docs.python.org/2/library/functions.html#cmp): `return 0 if x==y else (-1 if x – abarnert Sep 04 '14 at 21:58
  • @AmyForbes: But that code will break with the recursive case… – abarnert Sep 04 '14 at 21:58
0

If you simply need an order that is consistent across multiple runs of Python on potentially different platforms but don't actually care about the actual order then a simple solution is to dump the dicts to JSON before sorting them:

import json

def sort_as_json(dicts):
    return sorted(dicts, key=lambda d: json.dumps(d, sort_keys=True))

print(list(sort_as_json([{'foo': 'bar'}, {1: 2}])))
# Prints [{1: 2}, {'foo': 'bar'}]

Obviously this only works if your dicts are JSON-representable, but since you load them from JSON anyway that should be no problem. In your case, you could achieve the same result by simply sorting the file you're loading the objects from before deserializing the JSON.

Florian Brucker
  • 9,621
  • 3
  • 48
  • 81
  • Are you sure that 2 dictionaries with the same key-values always have the same json? Since dictionaries are unordered, wouldn't json.dumps produce a different string depending on which order the keys are stored internally? – zimkies Apr 22 '19 at 22:41
  • You're correct, @zimkies, my answer was missing the `sort_keys` option for `json.dumps`. I've updated the code accordingly. Thanks for pointing this out! – Florian Brucker Apr 23 '19 at 09:46
0

You can compare .items()

d1 = {"key1": "value1"}
d2 = {"key1": "value1", "key2": "value2"}
d1.items() <= d2.items()
True

But this is not recursive

d1 = {"key1": "value1", "key2": {"key11": "value11"}}
d2 = {"key1": "value1", "key2": {"key11": "value11", "key12": "value12"}}
d1.items() <= d2.items()
False
Belegnar
  • 721
  • 10
  • 24