44

The question arose when answering to another SO question (there).

When I iterate several times over a python set (without changing it between calls), can I assume it will always return elements in the same order? And if not, what is the rationale of changing the order ? Is it deterministic, or random? Or implementation defined?

And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?

The underlying question is if python set iteration order only depends on the algorithm used to implement sets, or also on the execution context?

Community
  • 1
  • 1
kriss
  • 23,497
  • 17
  • 97
  • 116
  • 4
    I think the consensus here is that no sane language would provide a data structure whose order changes spontaneously, but no sane programmer would make such an assumption without being explicitly told it. *Clearly* the answer is immutability by default. – Josh Lee Sep 28 '10 at 13:09
  • 2
    @JoshLee: Go's map iteration is deliberately randomized to catch bugs caused by ordering assumptions. – user2357112 May 17 '20 at 03:22
  • I would use the term "deterministic" instead of "stable" – OrenIshShalom Aug 07 '22 at 08:19

8 Answers8

23

There's no formal guarantee about the stability of sets. However, in the CPython implementation, as long as nothing changes the set, the items will be produced in the same order. Sets are implemented as open-addressing hashtables (with a prime probe), so inserting or removing items can completely change the order (in particular, when that triggers a resize, which reorganizes how the items are laid out in memory.) You can also have two identical sets that nonetheless produce the items in different order, for example:

>>> s1 = {-1, -2}
>>> s2 = {-2, -1}
>>> s1 == s2
True
>>> list(s1), list(s2)
([-1, -2], [-2, -1])

Unless you're very certain you have the same set and nothing touched it inbetween the two iterations, it's best not to rely on it staying the same. Making seemingly irrelevant changes to, say, functions you call inbetween could produce very hard to find bugs.

Neil G
  • 32,138
  • 39
  • 156
  • 257
Thomas Wouters
  • 130,178
  • 23
  • 148
  • 122
  • 4
    I would say that the stability of dict at least is guaranteed. The docs say: "If items(), keys(), values(), iteritems(), iterkeys(), and itervalues() are called with no intervening modifications to the dictionary, the lists will directly correspond." This implies that calling any of those methods repeatedly will return the same sequence if the dict is not modified. It also says that iter(dict) is a shortcut for dict.iterkeys() – Dave Kirby Sep 28 '10 at 12:35
  • I said "no *formal* guarantee". The dict docs can change (and such details have indeed changed in the past, not to mention differed between implementations); the "formal" (but rather terse) language specification at http://docs.python.org/ref doesn't mention it either way. – Thomas Wouters Sep 28 '10 at 12:39
  • @ThomasWouters: The language spec is not concerned with types except insofar as they affect the syntax and mentioning the built-ins that happen to implement the high-level concepts. The docs for `dict` are considered binding; sure, the implementation has changed over time, but the docs for `dict` guarantee very little. They specifically mention when it's a CPython implementation detail, and the repeatability of iteration ordering (given no intervening modifications) is not an implementation detail. The Python Standard Library is normative, not just the Python Language Reference. – ShadowRanger Oct 13 '16 at 19:32
  • For reference, [the `dict` requirements for `keys`/`items`/`values` since as early as 2.0 mention this repeatability guarantee](https://docs.python.org/2.0/lib/typesmapping.html) (see footnote 2). No such guarantee has ever been made for `set` (it shares algorithms with `dict` in many versions, but it's not 100% consistent, and the guarantee isn't as useful as it is for `dict`, so there is little benefit in making that guarantee and binding implementations to it. – ShadowRanger Oct 13 '16 at 19:37
  • If using a `frozenset` instead of `set`, would it be a sufficiently explicit way of iterating in a consistent order? – PlasmaBinturong Mar 16 '23 at 13:43
14

A set or frozenset is inherently an unordered collection. Internally, sets are based on a hash table, and the order of keys depends both on the insertion order and on the hash algorithm. In CPython (aka standard Python) integers less than the machine word size (32 bit or 64 bit) hash to themself, but text strings, bytes strings, and datetime objects hash to integers that vary randomly; you can control that by setting the PYTHONHASHSEED environment variable.

From the __hash__ docs:

Note

By default, the __hash__() values of str, bytes and datetime objects are “salted” with an unpredictable random value. Although they remain constant within an individual Python process, they are not predictable between repeated invocations of Python.

This is intended to provide protection against a denial-of-service caused by carefully-chosen inputs that exploit the worst case performance of a dict insertion, O(n^2) complexity. See http://www.ocert.org/advisories/ocert-2011-003.html for details.

Changing hash values affects the iteration order of dicts, sets and other mappings. Python has never made guarantees about this ordering (and it typically varies between 32-bit and 64-bit builds).

See also PYTHONHASHSEED.

The results of hashing objects of other classes depend on the details of the class's __hash__ method.

The upshot of all this is that you can have two sets containing identical strings but when you convert them to lists they can compare unequal. Or they may not. ;) Here's some code that demonstrates this. On some runs, it will just loop, not printing anything, but on other runs it will quickly find a set that uses a different order to the original.

from random import seed, shuffle

seed(42)

data = list('abcdefgh')
a = frozenset(data)
la = list(a)
print(''.join(la), a)

while True:
    shuffle(data)
    lb = list(frozenset(data))
    if lb != la:
        print(''.join(data), ''.join(lb))
        break    

typical output

dachbgef frozenset({'d', 'a', 'c', 'h', 'b', 'g', 'e', 'f'})
deghcfab dahcbgef
PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
4

And when I call the same python program repeatedly (not random, not input dependent), will I get the same ordering for sets?

I can answer this part of the question now after a quick experiment. Using the following code:

class Foo(object) :
  def __init__(self,val) :
    self.val = val
  def __repr__(self) :
    return str(self.val)

x = set()
for y in range(500) :
  x.add(Foo(y))
print list(x)[-10:]

I can trigger the behaviour that I was asking about in the other question. If I run this repeatedly then the output changes, but not on every run. It seems to be "weakly random" in that it changes slowly. This is certainly implementation dependent so I should say that I'm running the macports Python2.6 on snow-leopard. While the program will output the same answer for long runs of time, doing something that affects the system entropy pool (writing to the disk mostly works) will somethimes kick it into a different output.

The class Foo is just a simple int wrapper as experiments show that this doesn't happen with sets of ints. I think that the problem is caused by the lack of __eq__ and __hash__ members for the object, although I would dearly love to know the underlying explanation / ways to avoid it. Also useful would be some way to reproduce / repeat a "bad" run. Does anyone know what seed it uses, or how I could set that seed?

Andrew
  • 2,943
  • 18
  • 23
  • 6
    This is terribly easy to explain: because of the lack of `__eq__` and `__hash__`, your objects hash based on `id()`, and the id for the objects changes between runs. You aren't repeatedly printing `list()[-10:]` of the *same* set, just one that was created the same way. – Thomas Wouters Sep 28 '10 at 21:41
  • Why do the default implementations of __eq__ and __hash__ rely on the random module... It seems as if they use id() + something else. If I methods that use id() explicitly then the behaviour changes. – Andrew Sep 29 '10 at 07:48
  • 2
    They don't rely on the `random` module at all. They only use the `id` of the object. The `id` of an object is the memory address, so *between runs* it will differ because of all manner of reasons. If I implement `__hash__` and `__eq__` in terms of `id()`, the behaviour is the same. I'm not sure what you did, but I guess you did something different; you'll have to put up the code to tell. Or, you can stop worrying about the behaviour *between runs*, because there's no kind of guarantee about the order of sets or dicts in that case anyway. – Thomas Wouters Sep 29 '10 at 12:43
  • Thanks for the comments. I'll wrap up some test code and ask a separate question. – Andrew Sep 29 '10 at 13:10
3

It’s definitely implementation defined. The specification of a set says only that

Being an unordered collection, sets do not record element position or order of insertion.

Why not use OrderedDict to create your own OrderedSet class?

Josh Lee
  • 171,072
  • 38
  • 269
  • 275
  • I'm not saying I will use that behavior, just wondering where the bug seen by another poster could be coming from. Also there is a very similar property for dict that *is* guaranteed by python documentation (see http://stackoverflow.com/questions/3666237/are-order-of-keys-and-values-in-python-dictionary-guaranteed-to-be-the-same). Why there should be such differences between sets and dict is quite surprising. – kriss Sep 28 '10 at 12:42
3

The answer is simply a NO.

Python set operation is NOT stable.

I did a simple experiment to show this.

The code:

import random
random.seed(1)

x=[]
class aaa(object):
    def __init__(self,a,b):
        self.a=a
        self.b=b

for i in range(5):
    x.append(aaa(random.choice('asf'),random.randint(1,4000)))

for j in x:
    print(j.a,j.b)

print('====')
for j in set(x):
    print(j.a,j.b)

Run this for twice, you will get this:

First time result:

a 2332
a 1045
a 2030
s 1935
f 1555
====
a 2030
a 2332
f 1555
a 1045
s 1935

Process finished with exit code 0

Second time result:

a 2332
a 1045
a 2030
s 1935
f 1555
====
s 1935
a 2332
a 1045
f 1555
a 2030

Process finished with exit code 0

The reason is explained in comments in this answer.

However, there are some ways to make it stable:

Statham
  • 4,000
  • 2
  • 32
  • 45
  • 1
    True, but this is not what I asked. Your answer is about running the same process twice, my question was about iterating on the same set twice in the same process. Ok, this is the second half of my question. – kriss May 17 '20 at 15:24
2

As pointed out, this is strictly an implementation detail.

But as long as you don’t change the structure between calls, there should be no reason for a read-only operation (= iteration) to change with time: no sane implementation does that. Even randomized (= non-deterministic) data structures that can be used to implement sets (e.g. skip lists) don’t change the reading order when no changes occur.

So, being rational, you can safely rely on this behaviour.

(I’m aware that certain GCs may reorder memory in a background thread but even this reordering will not be noticeable on the level of data structures, unless a bug occurs.)

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
  • 3
    Being rational, we would also try to capture this assumption in a unit test so the program does not break in mysterious ways at a later date. :) – Josh Lee Sep 28 '10 at 12:26
  • @jleedev: True, but unfortunately I can easily see such a unit test fail to flag the error: if the behaviour is indeed nondeterministc, writing a reliable unit test for this behaviour will be incredibly hard. For example, I had a unit test suite on a parallel program that would fail only about once out of a hundred calls due to a race condition. In 99% of the cases, it would run through, even though it was a *very* thorough test suite. – Konrad Rudolph Sep 28 '10 at 12:30
  • But do you know for fact that sets don't change their order, from day to day? That would be insidious eg. for doctest printouts. – ankostis Apr 30 '23 at 05:08
1

The definition of a set is unordered, unique elements ("Unordered collections of unique elements"). You should care only about the interface, not the implementation. If you want an ordered enumeration, you should probably put it into a list and sort it.

There are many different implementations of Python. Don't rely on undocumented behaviour, as your code could break on different Python implementations.

Joe
  • 46,419
  • 33
  • 155
  • 245
0

I had the same question and tried the following:

import random
for _ in range(10**4):
    a = set([random.random(), random.random(), random.random(), random.random()])
    b = list(a)
    for _ in range(10**3):
        for i, val in enumerate(a):
            if b[i] != val:
                print('not same')

Nothing was printed, so I guess the order does not change when reading the set. As others have pointed out, the order may change when adding additional elements to it. I don't know what else may cause a re-odering, so I guess I will make sure to always use lists if the order is important.

KaPy3141
  • 161
  • 14