1

I've just observed an interesting behavior of a Set in Python3 and am wondering why.

Given the class:

class Tab:

    @staticmethod
    def set(size):
        return set(map(lambda label: Tab(label), range(1, size + 1)));

    def __init__(self, label):
        self.label = label
        self.up = True

    def __eq__(self, other):
        if not isinstance(other, Tab):
            return NotImplemented
        return self.label == other.label

    def __hash__(self):
        return hash(self.label)

    def __str__(self):
        return str(self.label)

When I invoke Tab.set(9) I get a set of tabs that, when represented as a string via:

"|%s|" % "|".join([str(tab) for tab in self.tabs])

Generates:

|1|2|3|4|5|6|7|8|9|

However, if I only modify __eq__ and __hash__ to incorporate the up attribute:

def __eq__(self, other):
    if not isinstance(other, Tab):
        return NotImplemented
    return self.label == other.label and self.up == other.up

def __hash__(self):
    return hash((self.label, self.up))

The implicit ordering of the set changes, and the string representation becomes:

|9|8|7|6|5|4|3|2|1|

I understand that sets are not ordered. But why did the implicit ordering change, when the static set method is unchanged, creating each element in the set from 1 through 9, just as before?

And, what might I do to preserve the implicit ordering so my set appears to be in order, just as before? (Note that the change is caused by the change in __hash__ and not __eq__.)

MSeifert
  • 145,886
  • 38
  • 333
  • 352
ybakos
  • 8,152
  • 7
  • 46
  • 74

2 Answers2

3

why did the implicit ordering change

Because set is implemented as hashtable in CPython. So:

  • The slot where an item is stored in the set depends on the (truncated) hash of the item (and to some extend the insertion order because the slot may be already occupied).
  • The hash of the item depends on the __hash__ method of that item.

And if you iterate over the set you iterate over the hashtable (excluding missing entries) slot-by-slot. So by changing the hashes you may change the order.

And in your case the hashes are different because you changed the __hash__ method implementation, so it's to be expected that the order is different:

>>> [hash(tab) for tab in Tab.set(9)]  # first code
[1, 2, 3, 4, 5, 6, 7, 8, 9]

>>> [hash(tab) for tab in Tab.set(9)]  # second code
[3713072971709512581, 3713088127104978631, 3713071889183430056, 3713087044578896106, 3713083796991988331, 3713082714465905806, 3713085962048483481, 3713084879522400956, 3713081631935493181]

what might I do to preserve the implicit ordering so my set appears to be in order, just as before

If you want them to be ordered my suggestion would be to not use a set - use an ordered collection, just to name one example: list. There are also ways to efficiently remove duplicates from a list and preserving the order.

But if you want to keep the set and want them to be sorted by the label property you could also use sorted:

sorted(tab.set(9), key=lambda t: t.label)

>>> [str(t) for t in sorted(Tab.set(9), key=lambda t: t.label)]
['1', '2', '3', '4', '5', '6', '7', '8', '9']

Inspecting the CPython 3.7.4 hashtable using Cython

Note: This is inspecting implementation details which may change across versions. The Cython code may not even work for different CPython versions. Don't take them literally and never ever depend on them.

In case you're interested how the actual implementation details of CPython look like, you can use this Cython helper for Jupyter I've written in this answer to "Updating a set while iterating over its elements":

%load_ext Cython
%%cython

from cpython cimport PyObject, PyTypeObject
cimport cython

cdef extern from "Python.h":
    ctypedef Py_ssize_t Py_hash_t

    struct setentry:
        PyObject *key
        Py_hash_t hash

    ctypedef struct PySetObject:
        Py_ssize_t ob_refcnt
        PyTypeObject *ob_type
        Py_ssize_t fill
        Py_ssize_t used
        Py_ssize_t mask
        setentry *table
        Py_hash_t hash
        Py_ssize_t finger

        setentry smalltable[8]
        PyObject *weakreflist

cpdef print_set_table(set inp):
    cdef PySetObject* innerset = <PySetObject *>inp
    for idx in range(innerset.mask+1):
        if (innerset.table[idx].key == NULL):
            print(idx, '<EMPTY>')
        else:
            print(idx, innerset.table[idx].hash, <object>innerset.table[idx].key)

This will print the internal hashtable, each line contains the slot number, the hash and the stored item.

In the first case:

>>> print_set_table(Tab.set(9))
0 <EMPTY>
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 6 6
7 7 7
8 8 8
9 9 9
10 <EMPTY>
11 <EMPTY>
[...]
30 <EMPTY>
31 <EMPTY>

Second case:

>>> print_set_table(Tab.set(9))
0 <EMPTY>
[...]
4 <EMPTY>
5 3713072971709512581 9
6 <EMPTY>
7 3713088127104978631 7
8 3713071889183430056 8
9 <EMPTY>
10 3713087044578896106 6
11 3713083796991988331 3
12 <EMPTY>
13 <EMPTY>
14 3713082714465905806 2
15 <EMPTY>
[...]
24 <EMPTY>
25 3713085962048483481 5
26 <EMPTY>
27 <EMPTY>
28 3713084879522400956 4
29 3713081631935493181 1
30 <EMPTY>
31 <EMPTY>
MSeifert
  • 145,886
  • 38
  • 333
  • 352
  • So in other words, when hashing on just the int, which matches my own conceptual ordering of ints, I've gotten "lucky" with the implicit ordering. – ybakos Sep 17 '19 at 17:00
  • 1
    @ybakos To some extend yes, it's because the hash of integers is just the integer (except for -1) and you also inserted them in-order. In case you're really interested in the implementation details (which may change across versions, so think of them as just-for-fun) I've added Cython code to introspect the set-internals. – MSeifert Sep 17 '19 at 17:08
  • 1
    Super awesome answer @MSeifert! – ybakos Sep 18 '19 at 20:41
2

why did the implicit ordering change, when the static set method is unchanged, creating each element in the set from 1 through 9, just as before?

This method is unchanged, but calls built-in set with Tab objects.

And since the __hash__ method has changed, set may change the internal order which you should not rely on anyway.

What would work in your case would be to sort when printing:

"|%s|" % "|".join([str(tab) for tab in sorted(self.tabs,lambda t:(t.label, t.up)])

or without lambda, define a __lt__ method so sort can compare objects

Jean-François Fabre
  • 137,073
  • 23
  • 153
  • 219