67

I understand that sets in Python are unordered, but I'm curious about the 'order' they're displayed in, as it seems to be consistent. They seem to be out-of-order in the same way every time:

>>> set_1 = set([5, 2, 7, 2, 1, 88])
>>> set_2 = set([5, 2, 7, 2, 1, 88])
>>> set_1
set([88, 1, 2, 5, 7])
>>> set_2
set([88, 1, 2, 5, 7])

...and another example:

>>> set_3 = set('abracadabra')
>>> set_4 = set('abracadabra')
>>> set_3
set(['a', 'r', 'b', 'c', 'd'])
>>>> set_4
set(['a', 'r', 'b', 'c', 'd'])

I'm just curious why this would be. Any help?

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
ivan
  • 6,032
  • 9
  • 42
  • 65
  • 4
    The same data running on the same version of Python will be put into the same hash buckets each time in the same order, so you can be sure they'll be the same under those particular circumstances. – Ry- Aug 28 '12 at 18:22
  • 2
    Related: [Why is python ordering my dictionary like so?](http://stackoverflow.com/questions/526125/why-is-python-ordering-my-dictionary-like-so). Sets are implemented much like dictionaries. – David Robinson Aug 28 '12 at 18:23
  • @minitech Is that by accident or by design? I mean, is there anything in the spec that says that list(set(...)) will return the same each time? Unless so, safe coding would dictate that you treat any given ordering as perfectly random and unlikely to repeat. – Kirk Strauser Aug 28 '12 at 18:49
  • @KirkStrauser: Nope, but it's just common sense. Why would there be a random number generator involved in the hash algorithm? But yes, good code doesn't rely on the order of this. – Ry- Aug 28 '12 at 19:11
  • @minitech Believe it or not, that's a security feature of Perl (see http://perldoc.perl.org/perlsec.html#Algorithmic-Complexity-Attacks, and http://www.cs.rice.edu/~scrosby/hash/CrosbyWallach_UsenixSec2003.pdf for justification). It was enabled by default on at least one Perl release; I don't know what its current status is. – Kirk Strauser Aug 28 '12 at 19:55
  • @KirkStrauser: Python has the same thing. See my comment on mgilson's answer. – John Y Aug 28 '12 at 20:14
  • @RyanO'Hara -- That's not true in more recent versions of python. Some types have hash randomization enabled. Multiple runs with the same strings can (and usually _do_) end up with different orders due to a random seeding of their `__hash__` function. – mgilson Jul 14 '16 at 23:07
  • @DavidRobinson Beginning with Python 3.6, sets are no longer implemented like dictionaries because dictionaries are now ordered: https://docs.python.org/3.6/whatsnew/3.6.html#other-language-changes – dshefman Mar 05 '20 at 20:30

5 Answers5

61

You should watch this video (although it is CPython1 specific and about dictionaries -- but I assume it applies to sets as well).

Basically, python hashes the elements and takes the last N bits (where N is determined by the size of the set) and uses those bits as array indices to place the object in memory. The objects are then yielded in the order they exist in memory. Of course, the picture gets a little more complicated when you need to resolve collisions between hashes, but that's the gist of it.

Also note that the order that they are printed out is determined by the order that you put them in (due to collisions). So, if you reorder the list you pass to set_2, you might get a different order out if there are key collisions.

For example:

list1 = [8,16,24]
set(list1)        #set([8, 16, 24])
list2 = [24,16,8]
set(list2)        #set([24, 16, 8])

Note the fact that the order is preserved in these sets is "coincidence" and has to do with collision resolution (which I don't know anything about). The point is that the last 3 bits of hash(8), hash(16) and hash(24) are the same. Because they are the same, collision resolution takes over and puts the elements in "backup" memory locations instead of the first (best) choice and so whether 8 occupies a location or 16 is determined by which one arrived at the party first and took the "best seat".

If we repeat the example with 1, 2 and 3, you will get a consistent order no matter what order they have in the input list:

list1 = [1,2,3]
set(list1)      # set([1, 2, 3])
list2 = [3,2,1]
set(list2)      # set([1, 2, 3])

since the last 3 bits of hash(1), hash(2) and hash(3) are unique.


1Note The implementation described here applies to CPython dict and set. I think that the general description is valid for all modern versions of CPython up to 3.6. However, starting with CPython3.6, there is an additional implementation detail that actually preserves the insertion order for iteration for dict. It appears that set still do not have this property. The data structure is described by this blog post by the pypy folks (who started using this before the CPython folks). The original idea (at least for the python ecosystem) is archived on the python-dev mailing list.

mgilson
  • 300,191
  • 65
  • 633
  • 696
  • Nice example. Unfortunately it implies that the order out will be the same as the order in, and I don't think that will always be the case. – Mark Ransom Aug 28 '12 at 18:30
  • @MarkRansom -- That isn't always the case, but I had to come up with an example where I knew for a fact there would be hash collisions ... I'll add a disclaimer. – mgilson Aug 28 '12 at 18:31
  • @MarkRansom -- Added another (counter) example where the hashes are all unique demonstrating that the order of the set is consistent in that scenario. Thanks for pointing out how that could be potentially confusing -- hopefully the answer is better now. – mgilson Aug 28 '12 at 18:41
  • 1
    Not that it affects the basic ideas presented here, but starting with Python 3.3, [hash randomization](http://www.gossamer-threads.com/lists/python/dev/979881) of strings and dates is turned on by default (and is available in the latest updates of 2.6, 2.7, 3.1, and 3.2). Your examples should not be affected, but the OP's string example could be. – John Y Aug 28 '12 at 19:02
  • @mgilson, the link for the video is timing out, this is the youtube link if you want to change it https://www.youtube.com/watch?v=C4Kc8xzcA68 – Padraic Cunningham Dec 16 '15 at 23:49
  • 1
    @PadraicCunningham -- Thanks. Youtube's probably more reliable than whatever that other site was. – mgilson Dec 16 '15 at 23:53
  • 4
    The note is rather misleading, `dict` implementation has changed in CPython 3.6, but `set`s remain un-ordered https://stackoverflow.com/questions/45581901/are-sets-ordered-like-dicts-in-python3-6/ – Chris_Rands Sep 05 '17 at 11:25
  • @Chris_Rands -- good find. I've tried to make the note a little more clear -- In the past, `set` and `dict` were very similar in implementation (at least, as far as I understand it). Now it appears that they've diverged (which I didn't realize). – mgilson Sep 05 '17 at 12:36
  • 2
    @mgilson Yes they diverged, but I'm no authority on CPython implementation. Anyway, I would recommend you just remove the note all together, it doesn't seem relevant to mention `dict`s here and may confuse readers IMO – Chris_Rands Sep 05 '17 at 15:51
6

The reason of such behavior is than Python use hash tables for dictionary implementation: https://en.wikipedia.org/wiki/Hash_table#Open_addressing

Position of the key is defined by it's memory address. If you know Python reuse memory for some objects:

>>> a = 'Hello world'
>>> id(a)
140058096568768
>>> a = 'Hello world'
>>> id(a)
140058096568480

You can see that object a has different address every time it's init.

But for small integers it isn't change:

>>> a = 1
>>> id(a)
40060856
>>> a = 1
>>> id(a)
40060856

Even if we create second object with different name it would be the same:

>>> b = 1
>>> id(b)
40060856

This approach allow to save memory which Python interpreter consume.

Eugene Soldatov
  • 9,755
  • 2
  • 35
  • 43
3

AFAIK Python sets are implemented using a hash table. The order in which the items appear depends on the hash function used. Within the same run of the program, the hash function probably does not change, hence you get the same order.

But there are no guarantees that it will always use the same function, and the order will change across runs - or within the same run if you insert a lot of elements and the hash table has to resize.

MAK
  • 26,140
  • 11
  • 55
  • 86
3

One key thing that's hinted at mgilson's great answer, but isn't mentioned explicitly in any of the existing answers:

Small integers hash to themselves:

>>> [hash(x) for x in (1, 2, 3, 88)]
[1, 2, 3, 88]

Strings hash to values that are unpredictable. In fact, from 3.3 on, by default, they're built off a seed that's randomized at startup. So, you'll get different results for each new Python interpreter session, but:

>>> [hash(x) for x in 'abcz']
[6014072853767888837,
 8680706751544317651,
 -7529624133683586553,
 -1982255696180680242]

So, consider the simplest possible hash table implementation: just an array of N elements, where inserting a value means putting it in hash(value) % N (assuming no collisions). And you can make a rough guess at how large N will be—it's going to be a little bigger than the number of elements in it. When creating a set from a sequence of 6 elements, N could easily be, say, 8.

What happens when you store those 5 numbers with N=8? Well, hash(1) % 8, hash(2) % 8, etc. are just the numbers themselves, but hash(88) % 8 is 0. So, the hash table's array ends up holding 88, 1, 2, NULL, NULL, 5, NULL, 7. So it should be easy to figure out why iterating the set might give you 88, 1, 2, 5, 7.

Of course Python doesn't guarantee that you'll get this order every time. A small change to the way it guesses at the right value for N could mean 88 ends up somewhere different (or ends up colliding with one of the other values). And, in fact, running CPython 3.7 on my Mac, I get 1, 2, 5, 7, 88.0

Meanwhile, when you build a hash from a sequence of size 11 and then insert randomized hashes into it, what happens? Even assuming the simplest implementation, and assuming there are no collisions, you still have no idea what order you're going to get. It will be consistent within a single run of the Python interpreter, but different the next time you start it up. (Unless you set PYTHONHASHSEED to 0, or to some other int value.) Which is exactly what you see.


Of course it's worth looking at the way sets are actually implemented rather than guessing. But what you'd guess based on the assumption of the simplest hash table implementation is (barring collisions and barring expansion of the hash table) exactly what happens.

abarnert
  • 354,177
  • 51
  • 601
  • 671
2

Sets are based on a hash table. The hash of a value should be consistent so the order will be also - unless two elements hash to the same code, in which case the order of insertion will change the output order.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622