How does std::unordered_set allocate bucket to provide O(1) lookup?

Question

The unordered_set should provide O(1) lookup time in case of a good hash function.
Each bucket contains items with the same hash value.
Let's assume that our hash function is ideal and gives no collisions at all.
Hash values can vary from 0 to max(std::size_t).
How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?
We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example.
Values in the middle are not used at all, but we allocated memory for them.

The complexity of [looking up a key](https://en.cppreference.com/w/cpp/container/unordered_set/find) is "[c]onstant on average, worst case linear in the size of the container.", not O(1). — Some programmer dude, Sep 22 '20 at 09:45
`Each bucket contains items with the same hash value.` `Let's assume that our hash function is ideal and gives no collisions at all.` These two statements have the opposite meaning. If your hash is ideal then you will have as many buckets as there are elements. So your lookup will be O(1). — Roy2511, Sep 22 '20 at 09:48
Yep, it will be. But how to allocate those buckets? If we allocate them continuously then we use a lot of memory but provide fast lookup. — Vladimir Tsyshnatiy, Sep 22 '20 at 09:50
You are overthink this. `unordered_set ` is implemented in such way that number of buckets doesn't exceed number of items. IMO this question suffers from [XY problem](http://xyproblem.info/). — Marek R, Sep 22 '20 at 10:03
>> in such a way that the number of buckets doesn't exceed the number of items. -- Yes it is. But how exactly can this be done? Assume we have two hash values 0 and 1000. Hence we have two buckets that correspond 0 and 1000 hash values. We do not spend any additional memory for buckets. Hence we need to use a linked list to store buckets (GCC does so as far as I know). Hence we have slow lookup because we need to iterate over buckets until we find the needed one that corresponds to the particular hash value. It looks to be linear for a good hash function. — Vladimir Tsyshnatiy, Sep 22 '20 at 10:18
The hash value isn't usually the bucket index. See for instance [Wikipedia on hash tables](https://en.wikipedia.org/wiki/Hash_table). — molbdnilo, Sep 22 '20 at 10:50

Tony Delroy · Accepted Answer · 2022-06-23T20:10:51.327

Each bucket contains items with the same hash value.

Wrong. Each bucket "contains" items with the same value of:

hash(key) % current number of buckets.

Let's assume that our hash function is ideal and gives no collisions at all.

Not having collisions in pre-mod space isn't necessarily ideal: what matters is collisions after modding (or masking if there's a power-of-2 bucket count, e.g. Visual C++) into the number of buckets.

And even then, having no collisions is not (normally) the design goal for a hash function used with a hash table. That's known as perfect hashing, and usually only practical when the keys are all known up front, pre-inserted, then fast lookup is wanted for the rest of the application's lifetime. In more common scenarios, the aim in the general case is to have - when inserting a new value - it be roughly as likely that you'll collide with already-inserted values as it would be if you picked a bucket a random. That's because a high-quality hash function effectively produces a random-but-repeatable value. Hash functions work well because it's acceptable to have a certain level of collisions, which is statistically related to the current load factor (and not the size() of the container).

Hash values can vary from 0 to max(std::size_t). How to organize unordered_set without allocating continuous memory interval for buckets and to provide O(1) lookup?

You typically have somewhere between 1 and ~2 buckets per value/element when the table had its largest size(). You can customise this somewhat by calling max_load_factor(float) to determine when the table resizes, but you can't customise by how much it resizes - that's left to the implementation. GCC will usually pick a prime a bit larger than twice the current size. Visual C++ will usually double the buckets.

We can not allocate continuous memory interval because if we allocate then we use a lot of memory for only to several hash values - 0 and 1000000 for example. Values in the middle are not used at all, but we allocated memory for them.

This ignores the modding of the hash value into the bucket count. (It also ignores the viability of sparse arrays, which can be practical because virtual address space can be massively larger than the physical RAM backing it, but that's not the point of hash tables.)

(I suggest you do some background reading on hash tables - perhaps [this answer](https://stackoverflow.com/a/30567466/410767) would be a good start) — Tony Delroy, Sep 22 '20 at 17:00

How does std::unordered_set allocate bucket to provide O(1) lookup?

1 Answers1