38

I expected the following two tuples

>>> x = tuple(set([1, "a", "b", "c", "z", "f"]))
>>> y = tuple(set(["a", "b", "c", "z", "f", 1]))

to compare unequal, but they don't:

>>> x == y
>>> True

Why is that?

Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
Ashish Anand
  • 522
  • 6
  • 11
  • 23
    why do you expect it to be False - the contents of the two sets are the same ! – Tony Suffolk 66 Sep 30 '14 at 08:08
  • [Relevant.](http://stackoverflow.com/a/26099671/1763356) As Zero Piraeus says, though, hash randomization makes this untrue **for strings, bytes and dates** as `hash(a_string)` will change every time you instantiate the interpreter. – Veedrac Sep 30 '14 at 09:20

4 Answers4

69

At first glance, it appears that x should always equal y, because two sets constructed from the same elements are always equal:

>>> x = set([1, "a", "b", "c", "z", "f"])
>>> y = set(["a", "b", "c", "z", "f", 1])
>>> x
{1, 'z', 'a', 'b', 'c', 'f'}
>>> y
{1, 'z', 'a', 'b', 'c', 'f'}
>>> x == y
True

However, it is not always the case that tuples (or other ordered collections) constructed from two equal sets are equal.

In fact, the result of your comparison is sometimes True and sometimes False, at least in Python >= 3.3. Testing the following code:

# compare.py
x = tuple(set([1, "a", "b", "c", "z", "f"]))
y = tuple(set(["a", "b", "c", "z", "f", 1]))
print(x == y)

... a thousand times:

$ for x in {1..1000}
> do
>   python3.3 compare.py
> done | sort | uniq -c
147 False
853 True

This is because, since Python 3.3, the hash values of strings, bytes and datetimes are randomized as a result of a security fix. Depending on what the hashes are, "collisions" may occur, which will mean that the order items are stored in the underlying array (and therefore the iteration order) depends on the insertion order.

Here's the relevant bit from the docs:

Security improvements:

  • Hash randomization is switched on by default.

https://docs.python.org/3/whatsnew/3.3.html

EDIT: Since it's mentioned in the comments that the True/False ratio above is superficially surprising ...

Sets, like dictionaries, are implemented as hash tables - so if there's a collision, the order of items in the table (and so the order of iteration) will depend both on which item was added first (different in x and y in this case) and the seed used for hashing (different across Python invocations since 3.3). Since collisions are rare by design, and the examples in this question are smallish sets, the issue doesn't arise as often as one might initially suppose.

For a thorough explanation of Python's implementation of dictionaries and sets, see The Mighty Dictionary.

vaultah
  • 44,105
  • 12
  • 114
  • 143
Zero Piraeus
  • 56,143
  • 27
  • 150
  • 160
  • Tried with Python 2.7.3, `for x in {1..5000}`, output is `5000 True`. But anyway, it is true, that we shouldn't rely on that – stalk Sep 30 '14 at 08:46
  • 3
    @stalk As I said, **since Python 3.3**, the behaviour has changed. See also [`object.__hash__`](https://docs.python.org/3/reference/datamodel.html#object.__hash__): "Changing hash values affects the iteration order of dicts, sets and other mappings. **Python has never made guarantees about this ordering** (and it typically varies between 32-bit and 64-bit builds)." (my emphasis) – Zero Piraeus Sep 30 '14 at 08:49
  • Wow. May be they should backport this security fix to 2.7 also. @RobertGrant - considering there are 6 items in the set, the hit ratio of 853 out of 1000 is definitely weird. – Ashish Anand Sep 30 '14 at 09:26
  • @AshishAnand You can enable it on 2.7 with `-R`, although [it's broken before version 3.4](http://bugs.python.org/issue14621). – Veedrac Sep 30 '14 at 09:30
  • @AshishAnand unfortunately, because the fix significantly changes behaviour that might be relied on (although it shouldn't have been), it would likely break working code, which isn't supposed to happen in micro releases (and [there will be no Python 2.8](http://legacy.python.org/dev/peps/pep-0404/)). – Zero Piraeus Sep 30 '14 at 09:33
  • 1
    The hashes may well still be entirely random, but the 853/1000 is not a measure of the collision rate, it is a measure that the hashes are in the same order between the two different sets. Depending on exactly how the hash is calculated - hash("a") < hash("b") could well be true very often even if the randomness is high (measured across multiple calls of hash(item)) – Tony Suffolk 66 Sep 30 '14 at 14:39
  • 1
    @TonySuffolk66 no, it *is* a measure of collisions - what matters is the order in the underlying array, which will only be different between `x` and `y` if the (truncated) hashes of two items are the same (so that the second item inserted has to go somewhere other than the first choice based on its hash). – Zero Piraeus Sep 30 '14 at 19:12
  • The frequency is consistent with a 1 in 6 rate: with an expected value of 167 / 1000, the standard deviation is `sqrt(1000*(1/6)*(5/6) = 11.8`, and `167 - 147 = 20`, less than two standard deviations. I can't quite figure out why the hash would be different just one out of six times. There is one number and five characters? I would test, but I am still on Python 2.7... – Floris Sep 30 '14 at 19:41
  • 3
    @Floris The number is 5/32. See: http://stackoverflow.com/questions/26136894/why-does-tupleset1-a-b-c-z-f-tupleseta-b-c-z-f-1/26136895#26136895 – Veedrac Oct 01 '14 at 08:10
  • 2
    @TonySuffolk66 It's exactly the measure of the collision rate. See: http://stackoverflow.com/questions/26136894/why-does-tupleset1-a-b-c-z-f-tupleseta-b-c-z-f-1/26136895#26136895 – Veedrac Oct 01 '14 at 08:10
12

There are two things at play here.

  1. Sets are unordered. set([1, "a", "b", "c", "z", "f"])) == set(["a", "b", "c", "z", "f", 1])

  2. When you convert a set to a tuple via the tuple constructor it essentially iterates over the set and adds each element returned by the iteration .

The constructor syntax for tuples is

tuple(iterable) -> tuple initialized from iterable's items

Calling tuple(set([1, "a", "b", "c", "z", "f"])) is the same as calling tuple([i for i in set([1, "a", "b", "c", "z", "f"])])

The values for

[i for i in set([1, "a", "b", "c", "z", "f"])]

and

[i for i in set(["a", "b", "c", "z", "f", 1])]

are the same as it iterates over the same set.

EDIT thanks to @ZeroPiraeus (check his answer ). This is not guaranteed. The value of the iteration will not always be the same even for the same set.

The tuple constructor doesn't know the order in which the set is constructed.

Community
  • 1
  • 1
srj
  • 9,591
  • 2
  • 23
  • 27
  • 1
    So is it guaranteed, that if values in two `set`s are the same, they will be iterated in same order? – stalk Sep 30 '14 at 08:30
  • 3
    @stalk **no**, it isn't: see [my answer](http://stackoverflow.com/a/26116307/1014938). – Zero Piraeus Sep 30 '14 at 08:40
  • 2
    You should update your answer as per @ZeroPiraeus's answer - "The values for _ and _ are the same as it iterates over the same set" is not always true. – icedtrees Sep 30 '14 at 08:52
6

Sets are not ordered and are defined only by their membership.

For example, set([1, 2]) == set([2, 1])

Tuples are equal if their members at each position are equal, but since the collections the tuples were created from iterate equally (in increasing order), tuples end up being equal as well.

Mdev
  • 2,440
  • 2
  • 17
  • 25
Amadan
  • 191,408
  • 23
  • 240
  • 301
5

so you have two lists - which have the same content but in different orders, you convert them into sets - which will be equal, as they have the same content.

When you convert those sets into tuples, they will be converted in the same order, as they are the same set, so the tuples will be the same.

This is true in Python2.7 - but from 3.3 onwards when hashes are randomised you will not be able to guarantee this - as the two sets, although equal in content wont neccessarily iterate in the same order.

Tony Suffolk 66
  • 9,358
  • 3
  • 30
  • 33
  • I didn't downvote, but your answer is incorrect, regardless of hash randomization: for example, in every Python since 2.4 (when `set` became a builtin), `tuple(set([0, 8])) != tuple(set([8, 0]))` (and the same applies to `sets.Set` in Python 2.3). – Zero Piraeus Oct 01 '14 at 09:06
  • so a set doesn't do what set should do - i.e. be equal when the contents are equal - very weird. – Tony Suffolk 66 Oct 01 '14 at 11:55
  • The fact that two objects are equal doesn't necessarily mean the result of the same operation on them should be. In this case the result occurs because two equal sets with different iteration order are used to construct an object which cares about order, but there are other examples, e.g. `8.0 == 8` but `str(8.0) != str(8)`. – Zero Piraeus Oct 01 '14 at 18:08