0

I want to know, how are unordered_set and unordered_map implemented in the C++ STL?

To be more precise:

  1. Are individual elements a part of the node which together are connected as a list or an array of consecutive elements?

  2. Are all buckets allocated side-by-side in memory? If not, what is the logic towards iterating over multiple buckets?

  3. How are collisions handled in unordered_set and unordered_map? Or, do they not and let the user define the load factor, and based on that they create a completely new set of buckets?

Remy Lebeau
  • 555,201
  • 31
  • 458
  • 770
Invictus
  • 4,028
  • 10
  • 50
  • 80
  • 4
    C++ as a language does not specify implementations. There are multiple open-source implementations that you can compare – Caleth Jun 20 '23 at 21:45
  • @Caleth When i write #include on a RHEL box and compile the code on machine using gcc what kind of unordered_set implementation is that ? – Invictus Jun 20 '23 at 21:47
  • 2
    [What you see here](https://en.cppreference.com/w/cpp/container/unordered_set) is the public interface. Anything beyond that, look at the source code for unordered_set/unordered_map for the implementation you are using. – PaulMcKenzie Jun 20 '23 at 21:53
  • 1
    Voting to reopen. Looks focused enough to me, and the answer ended up reasonably short. – HolyBlackCat Jun 20 '23 at 22:53
  • What is normally important is the public interface , and a description of its behavior. E.g. time complexity. From cppreference : *std:unordered_map is an associative container that contains key-value pairs with unique keys. Search, insertion, and removal of elements have average constant-time complexity.* So why do you want to know about the internal implementation? Do you have an actual performance issue? – Pepijn Kramer Jun 21 '23 at 05:08
  • @PepijnKramer it is more out of curiosity, I saw somewhere google flat_hash_set and how it was better than standard set part of STL libraries, so I wanted to know more about STL implementation. – Invictus Jun 21 '23 at 16:29
  • In the cases I've am curious about things like this examining the source code is usually the most informative. – Pepijn Kramer Jun 21 '23 at 16:39
  • The great thing about template classes is the the source is *always* available. – Mark Ransom Jun 22 '23 at 02:39

2 Answers2

3

Based on its interface, the expectation for unordered_(set|map) was something at least reasonably close to a hash table with direct chaining.

That is, you'd start with an array of pointers. Each of those pointers is to a bucket, which is typically going to be a linked list of nodes.

We can deduce most of this from the complexity requirements. In particular, the fact that it degenerates from O(1) to O(N) as the table gets overfilled indicates that under normal circumstances (load factor < 1) the complexity is based on hashing the key to find the right bucket. To support that, indexing into the array of buckets has to be an O(1) operation.

As the table gets overfilled, complexity degenerates to O(N). This reflects the complexity of traversing the linked list of nodes that hashed to the same bucket.

Pointer invalidation rules are revealing as well. For example, once we've inserted an item into the table, a pointer to that item remains valid (and continues to point to that node) even if we do things like inserting or deleting other nodes. As such, we can rearrange the table as a whole by having a different arrangement of pointers to the nodes, but the node itself has to "stay put". This means we can't (for example) implement a bucket as something like a std:vector<T>--if we did so, when that vector got reallocated, existing pointers would be invalidated.

You might be able to deviate from this a little bit if you tried really hard, but doing so and continuing to meet all the requirements becomes quite difficult (as I originally wrote this, I speculated on a couple of minor variations maybe being possible, but after comments and more thought, I doubt either really meets the requirements).

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • Using something like a deque for the buckets would seem to preclude rehashing and increasing the number of buckets when the table gets overfull. This would seem to conflict with the requirement that the load factor be kept under a bound as the table grows as long as the keys have distinct hash values. – Chris Dodd Jun 20 '23 at 22:37
  • I've edited. I was trying to make the point that although other implementations might just barely be possible, with even seemingly minor changes, it becomes difficult--but it started to sound more like I was saying kind of the opposite of the point I was trying to make--that based on the interface, we really do have a pretty solid idea of how it just about has to be implemented (which the comments reinforce). – Jerry Coffin Jun 20 '23 at 22:42
  • 1
    Thanks @JerryCoffin for explanation. – Invictus Jun 21 '23 at 16:40
0

Are individual elements a part of the node which together are connected as a list or an array of consecutive elements?

(see also https://stackoverflow.com/a/31113618/410767 for an explanation of how the C++ Standard effectively mandates separate chaining)

The individual elements (i.e. key,value pairs for unordered_map, values for unordered_set) are indeed packed into nodes (in GCC's implementation for example, adding next pointers/iterators and a record of the hash of the key, though the latter is a recalculation-speed vs memory usage tradeoff and not all implementations need do it; GCC uses a singly-linked list for all elements in the container, so there is no prev pointer/iterator, but other implementations might have prev; without prev, buckets need to point to the node before the first for their bucket contents, so they can rewire the list if erasing the last element from their bucket).

Are all buckets allocated side-by-side in memory? If not, what is the logic towards iterating over multiple buckets?

By definition hash table based containers always have a contiguous, random-access array of buckets. But, iteration need not happen "over multiple buckets" - in GCCs case, as mentioned above, there's a singly linked list off to the side that can be iterated directly, independently of any buckets. A major advantage of this is that if your container has large capacity but small size, you can still iterate in O(size()) rather than O(bucket_count()) - I'm not sure the latter would be Standard compliant (?).

How are collisions handled in unordered_set and unordered_map?

Using separate chaining (i.e. by inserting extra nodes into the linked list).

Or, do they not and let the user define the load factor, and based on that they create a completely new set of buckets?

The load factor is only tangentially related to this. In the general case where number of distinct key values is much larger than the number of elements being stored, and you have no particular insight into which keys you'll receive, even with a cryptographically strong hash function you need on-the-order-of N^2 buckets to have even a 50/50 chance of placing K keys into the table without them colliding. It's just not practical to grow the hash table to avoid collisions. In C++, the load factor is capped by a user-settable (default 1.0) max_load_factor, so the implementation grows the table when the load factor would otherwise be exceeded. But the load factor only has a probabilistic relationship with collisions.

In case you're curious: if you have K keys into B buckets, the probability of not having a collision is the product of the probabilities of being able to place each successive key into an at-that-time empty bucket, which depends on the proportion of buckets still empty: B/B * (B-1)/B * (B-2)/B * (B-3)/B * ... * (B-K-1)/B. As a few examples, the chance of not having collisions is >= 50% for 2 keys in 2 buckets, 3 keys in 6 buckets, 4 in 10, 5 in 17, 6 in 24, 7 in 33, 8 in 43, 9 in 55, 10 in 69, 11 in 83, 12 in 100, 13 in 117, 14 in 136, 15 in 157, 16 in 179, 17 in 202, 18 in 227, 19 in 253, 20 in 281, 21 in 310, 22 in 341, 23 in 373, 24 in 407, 25 in 442, 26 in 478, 27 in 516, 28 in 555, 29 in 596, 30 in 638, 31 in 682, 32 in 727, 33 in 773, 34 in 821, 35 in 870, 36 in 921, 37 in 974, 38 in 1027, 39 in 1082, 40 in 1139, 41 in 1197, 42 in 1257, 43 in 1317, 44 in 1380, 45 in 1444, 46 in 1509, 47 in 1576, 48 in 1644, 49 in 1713, 50 in 1784, 51 in 1857, 52 in 1931, 53 in 2006, 54 in 2083, 55 in 2161, 56 in 2241, 57 in 2322, 58 in 2405, 59 in 2489, 60 in 2574, 61 in 2661, 62 in 2749, 63 in 2839, 64 in 2930, 65 in 3023, 66 in 3117, 67 in 3213, 68 in 3310, 69 in 3408, 70 in 3508, 71 in 3609, 72 in 3712, 73 in 3816, 74 in 3922, 75 in 4029, 76 in 4137, 77 in 4247, 78 in 4359, 79 in 4472, 80 in 4586, 81 in 4702, 82 in 4819, 83 in 4938, 84 in 5058, 85 in 5179, 86 in 5302, 87 in 5427, 88 in 5552, 89 in 5680, 90 in 5808, 91 in 5939, 92 in 6070, 93 in 6203, 94 in 6338, 95 in 6474, 96 in 6611, 97 in 6750, 98 in 6890, 99 in 7032, 100 in 7175....

Put another way, for 100 keys we'd need to have 70.75 empty buckets for every 1 in-use bucket just to have a >=50% chance of not having collisions, which would waste a lot of memory and ensure the in-use buckets were spread out across CPU cache lines, making bucket access (and general CPU operation) much slower. Trying to avoid collisions, or rehashing every time they occur, is simply not a net-desirable thing for the general hash-table usage case. That said, perfect hashing (where the keys hash to distinct buckets) is occasionally both practical and desirable for small sets of keys known at compile time, or when there are a very small number of possible keys.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252