0

The following unittest fails when running it (with pytest), but when I debug it, it passes:

def test():
    assert list(set(['B', 'A'])) == ['A', 'B']

I know that sets have no order, but I don't understand how to determine the result of list(s) if s is a set, which is crucial when writing unittests. A workaround mentioned here would be to change the code to sorted(s), but I want to understand what happens when list(s) is run.

Qaswed
  • 3,649
  • 7
  • 27
  • 47
  • 3
    Since sets have no order, it is possible that `list(set(['B', 'A']))` will return `['B', 'A']`. Sometimes though – 12944qwerty May 27 '21 at 14:28
  • This test is in error - the iteration order of a set is undefined so you cannot rely on the ordering of the resulting list. `sorted` as noted is what you'd want to use - I'd call it a bugfix rather than a workaround, though that's just semantics. – Peter DeGlopper May 27 '21 at 14:35
  • @PeterDeGlopper: Minor quibble: I'd call it "arbitrary" rather than "undefined" to avoid confusion with the concept from languages like C where undefined behavior can, by the spec, include stuff like nasal demons and melting your computer to slag. Python `set`s have an order, it's just arbitrary; the only thing you can count on is that for a given run of the program, for a given `set` that is not modified between one iteration and the next, the iteration order will be the same (e.g. `for x, y in zip(myset, map(func, myset)):` is guaranteed to pair elements with their `func` mapped version). – ShadowRanger May 27 '21 at 14:45
  • @ShadowRanger - equally minor quibble but I think even that behavior, of stability in iteration if there are no modifications, is an implementation detail that isn't formally part of the language spec. It's certainly the only reasonable choice but I don't think an implementation that made an unreasonable choice could truly be said to be not to spec. Just making poor decisions. It's the same distinction you're making in the last comment on this answer - https://stackoverflow.com/a/3812600/2337736 - assuming that hasn't changed in the last 5 years! – Peter DeGlopper May 27 '21 at 15:41
  • (Not stalking you, I just found that answer while trying to verify whether it's strictly required or just almost certainly always the case but _technically_ breakable and noticed the username in the comments!) – Peter DeGlopper May 27 '21 at 15:42
  • @PeterDeGlopper: You got me there. :-) I suppose an aggressively optimized sort of `set` *might* perform rehashings in a background thread or something, when the `set` is not otherwise in use, to avoid a blocking rehash, so something like `list(s) == list(s)` could fail if `s` rehashed between `list` calls. I'll admit I can't find anything that says "iterating the same collection (or `set` specifically) twice without changing it should get the elements in the same order", so technically, you'd have to rely on `itertools.tee` to make that `zip` guaranteed to work under the language spec. Eew. – ShadowRanger May 27 '21 at 16:07

1 Answers1

6

list(s) is just iterating the set in whatever order the set produces its output. The problem is that set order is repeatable only within a given run of Python, and only if the set is constructed in exactly the same way, because:

  1. CPython (and likely others) use an open-addressing based hash table with dummy entries, and iteration is just scanning the whole table for entries in order; if you build the set in different ways, even in the same process, colliding hash codes occurring in different orders produce different orderings (the first one to grab the bucket keeps it, the next is displaced to a different location that may be nowhere near the original bucket).
  2. String or bytes-like data (as well as some esoteric stuff like datetime objects) uses a cryptographic perturbation of the hash with a per-process seed, so in different runs of Python, the exact same set constructed the exact same way may have the entries in an entirely different order.

In short, if you want to check if the set is equal, you have three options:

  1. Keep the data as a set, e.g.:

    def test():
        assert set(['B', 'A']) == {'A', 'B'}
    

    which won't care about order, or

  2. (CPython/PyPy 3.6, any Python 3.7+) Use a dict to simulate a set but with guaranteed insertion ordering, e.g.:

    assert list(dict.fromkeys(['B', 'A'])) == ['B', 'A']  # Passes
    assert list(dict.fromkeys(['B', 'A'])) == ['A', 'B']  # Fails
    

    On older Python, collections.OrderedDict can be used to get the guaranteed ordering instead, though it's slower and more memory-hungry. Unlike even the insertion-ordered dict, OrderedDict's == is order-sensitive, so you could directly compare OrderedDicts instead of converting back to list if you wanted while still requiring a specific key ordering.

  3. For types with a total ordering, use sorted as mentioned in your linked answer to get predictable order

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271