How does Python's hash table algorithm handle empty buckets?

Question

From Fluent Python...

To fetch the value at my_dict[search_key] , Python calls hash(search_key) to obtain the hash value of search_key and uses the least significant bits of that number as an offset to look up a bucket in the hash table (the number of bits used depends on the current size of the table). If the found bucket is empty, KeyError is raised.

If only the least significant bits of the hash value are used, is it possible that an empty bucket and non-empty bucket share the same least significant bits and a KeyError is mistakenly raised because the empty bucket was encountered first?
What does using something as an "offset" mean in this context? Please provide an example.

Taking some number (say 8) bits from the hash gives you a number in a fixed range (say 0-255). You use that number to index a fixed-length array of buckets. That index is the "offset". — Blorgbeard, Jun 12 '19 at 17:58
Those least-significant bits are the sole identifier of the buckets - I would have used "index" rather than "offset" to describe this. It's not possible for multiple buckets to share the same bits. — jasonharper, Jun 12 '19 at 17:58

user4815162342 · Answer 1 · 2019-06-13T05:07:12.083

is it possible that an empty bucket and non-empty bucket share the same least significant bits

Buckets are identified by their position, i.e. each bucket has a distinct position in the table. That's why you can't have two buckets sharing the least significant bits - those bits uniquely determine the bucket.

What you can have is two keys hashing to the same value - or to two different values sharing the same bottom bits - a situation known as collision. Python resolves a collision at insert by finding a different bucket, and at lookup by comparing the key stored in the bucket to the one being looked for. The goal of a good hash table is to minimize collisions, which is achieved by using a good hash function and by keeping the table sparse, so that buckets are somewhat spread out.

What does using something as an "offset" mean in this context?

Offset is another word for bucket position. It is simply the index of a particular bucket in the array that holds them.

"That's why you can't have two buckets sharing the least significant bits - those bits uniquely determine the bucket" is false. — lalebarde, Jan 26 '22 at 17:22

score 0 · Answer 2 · answered Jan 26 '22 at 17:48

Python dictionary is a hash table with open addressing (cf 1, 2). The size of the hash table increases when a load threshold is reached. Then the number of empty buckets is minimized.

Open addressing means not only buckets corresponding to the hashes of the inserted elements are used since collisions may occur (same hash for two different elements). Each new element entering in collision is put in the first empty bucket available, jumping from the collided bucket by an offset modulo the table size, with as many jumps as necessary.

To answer your first question, if the found bucket is empty in a fetch operation, it means that the table is broken, it should not happen, then the KeyError is raised.

Your second question is answered above: the offset is used to find an empty bucket when inserting and to find the right key when fetching, both from the bucket pointed by the hash.

Note : in CPython, a pseudo-random number is used instead of the least significant bits to determine the offset.

How does Python's hash table algorithm handle empty buckets?

2 Answers2