Are individual elements a part of the node which together are connected as a list or an array of consecutive elements?
(see also https://stackoverflow.com/a/31113618/410767 for an explanation of how the C++ Standard effectively mandates separate chaining)
The individual elements (i.e. key,value pairs for unordered_map
, values for unordered_set
) are indeed packed into nodes (in GCC's implementation for example, adding next
pointers/iterators and a record of the hash of the key, though the latter is a recalculation-speed vs memory usage tradeoff and not all implementations need do it; GCC uses a singly-linked list for all elements in the container, so there is no prev
pointer/iterator, but other implementations might have prev
; without prev
, buckets need to point to the node before the first for their bucket contents, so they can rewire the list if erasing the last element from their bucket).
Are all buckets allocated side-by-side in memory? If not, what is the logic towards iterating over multiple buckets?
By definition hash table based containers always have a contiguous, random-access array of buckets. But, iteration need not happen "over multiple buckets" - in GCCs case, as mentioned above, there's a singly linked list off to the side that can be iterated directly, independently of any buckets. A major advantage of this is that if your container has large capacity but small size, you can still iterate in O(size()) rather than O(bucket_count()) - I'm not sure the latter would be Standard compliant (?).
How are collisions handled in unordered_set and unordered_map?
Using separate chaining (i.e. by inserting extra nodes into the linked list).
Or, do they not and let the user define the load factor, and based on that they create a completely new set of buckets?
The load factor is only tangentially related to this. In the general case where number of distinct key values is much larger than the number of elements being stored, and you have no particular insight into which keys you'll receive, even with a cryptographically strong hash function you need on-the-order-of N^2 buckets to have even a 50/50 chance of placing K keys into the table without them colliding. It's just not practical to grow the hash table to avoid collisions. In C++, the load factor is capped by a user-settable (default 1.0) max_load_factor, so the implementation grows the table when the load factor would otherwise be exceeded. But the load factor only has a probabilistic relationship with collisions.
In case you're curious: if you have K keys into B buckets, the probability of not having a collision is the product of the probabilities of being able to place each successive key into an at-that-time empty bucket, which depends on the proportion of buckets still empty: B/B * (B-1)/B * (B-2)/B * (B-3)/B * ... * (B-K-1)/B. As a few examples, the chance of not having collisions is >= 50% for 2 keys in 2 buckets, 3 keys in 6 buckets, 4 in 10, 5 in 17, 6 in 24, 7 in 33, 8 in 43, 9 in 55, 10 in 69, 11 in 83, 12 in 100, 13 in 117, 14 in 136, 15 in 157, 16 in 179, 17 in 202, 18 in 227, 19 in 253, 20 in 281, 21 in 310, 22 in 341, 23 in 373, 24 in 407, 25 in 442, 26 in 478, 27 in 516, 28 in 555, 29 in 596, 30 in 638, 31 in 682, 32 in 727, 33 in 773, 34 in 821, 35 in 870, 36 in 921, 37 in 974, 38 in 1027, 39 in 1082, 40 in 1139, 41 in 1197, 42 in 1257, 43 in 1317, 44 in 1380, 45 in 1444, 46 in 1509, 47 in 1576, 48 in 1644, 49 in 1713, 50 in 1784, 51 in 1857, 52 in 1931, 53 in 2006, 54 in 2083, 55 in 2161, 56 in 2241, 57 in 2322, 58 in 2405, 59 in 2489, 60 in 2574, 61 in 2661, 62 in 2749, 63 in 2839, 64 in 2930, 65 in 3023, 66 in 3117, 67 in 3213, 68 in 3310, 69 in 3408, 70 in 3508, 71 in 3609, 72 in 3712, 73 in 3816, 74 in 3922, 75 in 4029, 76 in 4137, 77 in 4247, 78 in 4359, 79 in 4472, 80 in 4586, 81 in 4702, 82 in 4819, 83 in 4938, 84 in 5058, 85 in 5179, 86 in 5302, 87 in 5427, 88 in 5552, 89 in 5680, 90 in 5808, 91 in 5939, 92 in 6070, 93 in 6203, 94 in 6338, 95 in 6474, 96 in 6611, 97 in 6750, 98 in 6890, 99 in 7032, 100 in 7175....
Put another way, for 100 keys we'd need to have 70.75 empty buckets for every 1 in-use bucket just to have a >=50% chance of not having collisions, which would waste a lot of memory and ensure the in-use buckets were spread out across CPU cache lines, making bucket access (and general CPU operation) much slower. Trying to avoid collisions, or rehashing every time they occur, is simply not a net-desirable thing for the general hash-table usage case. That said, perfect hashing (where the keys hash to distinct buckets) is occasionally both practical and desirable for small sets of keys known at compile time, or when there are a very small number of possible keys.