2

I'm confused by the behavior of Python's set() in this example:

random_number_list = [randint(1, 10) for _ in range(10)]
# This will be sorted!
unique_numbers = set(random_number_list)

print(
    f"random_number_list/unique_numbers with same upper bound for randint() and range():\n{random_number_list=}\n{unique_numbers=}\n"
)

random_number_list = [randint(1, 100) for _ in range(10)]
# This will not be sorted.
unique_numbers = set(random_number_list)

print(
    f"random_number_list/unique_numbers with different upper bound for randint() and range():\n{random_number_list=}\n{unique_numbers=}\n"
)

It seems like set() is sorting the random_number_list if the length of the list and the upper bound of randint() are the same:

➜  ch-2 python --version
Python 3.10.0
➜  ch-2 python find_k_smallest.py 
random_number_list/unique_numbers with same upper bound for randint() and range():
random_number_list=[10, 1, 2, 5, 5, 7, 8, 8, 2, 8]
unique_numbers={1, 2, 5, 7, 8, 10}

random_number_list/unique_numbers with different upper bound for randint() and range():
random_number_list=[35, 1, 17, 26, 17, 74, 26, 11, 44, 13]
unique_numbers={1, 35, 74, 11, 44, 13, 17, 26}

The docs state:

A set object is an unordered collection of distinct hashable objects.

What's going on here? Why is set() sorting the random_number_list in certain cases and not others?

Edit Neither of these questions addresses my issue:

djs
  • 3,947
  • 3
  • 14
  • 28
  • 1
    Why is this a problem? – pjs Jan 02 '22 at 17:04
  • I don't think it's a problem, I just don't understand why it's happening and was hoping someone could enlighten me. – djs Jan 02 '22 at 17:05
  • 3
    If it happens "in certain cases and not others", then the docs are telling you the truth and you shouldn't count on ordering even though it may occur. – pjs Jan 02 '22 at 17:11
  • Related: https://stackoverflow.com/a/61100205/12416453 – Ch3steR Jan 02 '22 at 17:14
  • 3
    Sets are generally unordered(doesn't mean it's strictly unsorted). The ordering may change from implementation to implementation. – Ch3steR Jan 02 '22 at 17:15
  • Interesting. I only have a few Python binaries installed and all of them exhibit the same behavior, but I can see that there are differences from the link @Ch3steR posted. I guess I'll chalk this up to an oddity of the implementation! I didn't know if this was something Pythonistas know about but newcomers don't, as I'm still pretty new to the language. – djs Jan 02 '22 at 17:21
  • The bottom-line is to not write code that counts on the ordering of members of a set being in any particular order — so it will work regardless of the implementation. – martineau Jan 02 '22 at 17:28
  • Related: https://stackoverflow.com/questions/63845073/my-python-multiprocesses-are-apparently-not-independent/64866191#64866191 – Peter O. Jan 02 '22 at 17:46
  • @PeterO. neither of those answer my question. I'm curious as to why `set()` is ordering its members sometimes when the size of the `list` is related to the bounds of `randint()`. I don't care about an ordered `set` implementation, though. Thanks for the links, though! – djs Jan 02 '22 at 19:38

3 Answers3

3

To actually answer your question. Many implementations of sets use an implementation similar to hashtables. Items are hashed and placed into an "array" based on that hash value.

Notice that for small integers, hash(x) == x. So 1 will go into slot 1, 2 will go into slot 2, 3 into slot 3, etc. Then when the elements are read, you'll literally get the elements sorted.

However if your integers are bigger than the array size, then their placement in the array will be modulo the size of the array. They will no longer be sorted.

Again, I have not actually looked at the Python implementation. This is simply a possible explanation of what could be happening.

Frank Yellin
  • 9,127
  • 1
  • 12
  • 22
  • 1
    There isn't a single "Python implementation", there are several, so looking at any one of them won't tell you anything that can be generalized. – martineau Jan 02 '22 at 17:33
  • This is very enlightening, thank you! It lines up with what I'm seeing as well. – djs Jan 02 '22 at 19:41
  • @martineau. I agree that there is no single "Python implementation", but I think we can agree that CPython is the most commonly used implementation. Artifacts of CPython's implementation (e.g. ordered dictionaries) sometimes become part of the standard. And yes, the implementation of sets in CPython appears to be pretty much what I expected. And no, OP shouldn't rely on the set being sorted. – Frank Yellin Jan 03 '22 at 21:49
2

"Unordered" does not mean "not sorted". It means no attempt is made to provide any particular order; the order that falls out from the implementation may or may not be a sorted order.

chepner
  • 497,756
  • 71
  • 530
  • 681
  • Right. But it *is* ordered every time when the size of the `list` that I convert to a `set` is related to the upper bound of the `randint()` that populates the `list`. That's the part that is confusing to me. There must be something ordering the `set` for this condition, I just wasn't sure if that was a known property or something that seems to be a weird implementation quirk. – djs Jan 02 '22 at 19:39
  • It's an implication quirk, the particular quirk being that sets are based on hash tables, the hash value is the integer itself, and that you aren't using any integers larger than the size of the hash table. At least on my machine, it took about 2 seconds to find a set that *isn't* sorted: `set([1000000000000000000, 2])`. – chepner Jan 02 '22 at 19:42
  • Correct, not all sets are sorted. I also demonstrated that in my example. But, certain sets are *always* sorted. The hash table explanation makes sense, I'm researching it now. I don't have a CS background but I'm currently studying CS topics so this has been very enlightening! – djs Jan 02 '22 at 19:46
0

You wrote in a comment:

I'm curious as to why set() is ordering its members sometimes when the size of the list is related to the bounds of randint().

That is an implementation detail that an application should not concern itself with, as even in Python 3.7 (and 3.10), sets are documented as "unordered collection[s]". You can look up, for example, the source code of CPython to find out how sets are implemented in CPython.

See also:

Peter O.
  • 32,158
  • 14
  • 82
  • 96
  • I understand that it's not something to rely on, I was merely curious why it was happening. Thanks for the links, this seems to be something I can dig into: https://github.com/python/cpython/blob/main/Objects/setobject.c. – djs Jan 02 '22 at 23:33