4

As we know, Some of Python's data structures use hash tables for storing items like set or dictionary. So there is no order in these objects. But it seems that, for some sequences of numbers that's not true.

For example consider the following examples :

>>> set([7,2,5,3,6])
set([2, 3, 5, 6, 7])

>>> set([4,5,3,0,1,2])
set([0, 1, 2, 3, 4, 5])

But it isn't sorted if we make a small change :

>>> set([8,2,5,3,6])
set([8, 2, 3, 5, 6])

So the question is: How does Python's hash function work on integer sequences?

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • 4
    "Not ordered" does not mean "will never appear ordered"; it just means there is no particular *guaranteed* order. – chepner May 29 '15 at 19:24
  • 2
    fun fact: the `go` authors decided to actively randomize the iteration over these data structures to remind users there is no guaranty over its order. http://blog.golang.org/go-maps-in-action (under "Iteration order") – RickyA May 29 '15 at 20:49

1 Answers1

9

Although there are a lot of questions in SO about hash and its order,but no one of them explains the algorithm of hash function.

So all you need here is know that how python calculate the indices in hash table.

If you go through the hashtable.c file in CPython source code you'll see the following lines in _Py_hashtable_set function which shows the way python calculate the index of hash table keys :

key_hash = ht->hash_func(key);
index = key_hash & (ht->num_buckets - 1);

So as the hash value of integers is the integer itself * (except for -1) the index is based on the number and the length of your data structure (ht->num_buckets - 1) and it calculated with Bitwise-and between (ht->num_buckets - 1) and the number.

Now consider the following example with set that use hash-table :

>>> set([0,1919,2000,3,45,33,333,5])
set([0, 33, 3, 5, 45, 333, 2000, 1919])

For number 33 we have :

33 & (ht->num_buckets - 1) = 1

That actually it's :

'0b100001' & '0b111'= '0b1' # 1 the index of 33

Note in this case (ht->num_buckets - 1) is 8-1=7 or 0b111.

And for 1919 :

'0b11101111111' & '0b111' = '0b111' # 7 the index of 1919

And for 333 :

'0b101001101' & '0b111' = '0b101' # 5 the index of 333

And as well as for the preceding examples in question :

>>> set([8,2,5,3,6])
set([8, 2, 3, 5, 6])

'0b1000' & '0b100'='0b0' # for 8
'0b110' & '0b100'='0b100' # for 8

* The hash function for class int :

class int:
    def __hash__(self):
        value = self
        if value == -1:
            value = -2
        return value

Mazdak
  • 105,000
  • 18
  • 159
  • 188
  • There's a small error in your code snippet at the end: `value == -2` would not assign `-2` to `value`, and would also always produce `False` since we check if `value == -1` on the previous line. Since the edit is less than 6 characters, I cannot make it myself. – Alex Huszagh Jan 27 '16 at 21:14
  • 2
    In the examples you give at the end, you seem to be assuming that `ht->num_buckets` is equal to the number of items in the set. That's not true: the number of buckets is a power of 2, and is usually significantly larger than the number of items in the set (indeed, it's bad for hash collisions for all or nearly all of the buckets to be filled; the heuristic that Python uses is to enlarge the hash table once it becomes 2/3 full). – Mark Dickinson Feb 13 '16 at 20:17
  • @MarkDickinson Yes I was thought like that. It's interesting , but what's the problem with being as size of items (for buckets)? And can you pleas explain that how Cpython calculate the number of buckets, so that I could update my answer with correct informations? – Mazdak Feb 13 '16 at 20:33