6

I've a little bit confusion how unordered_map works and what buckets are and how thet are managed.

From this blog post, unordered_map is vector of vectors.

My questions are:

  • is it correct to assume that the buckets are the "internal" vectors?
  • since each bucket (vector) can contains multiple elements, given by an hash collision on the hash table (the "external" vector), and since we have to scan this internal vector (in linear time), is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?
  • what is the external vector (hash table) size by default?
  • what is the internal vector size by default?
  • what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?

Sorry for these questions, but I didn't find any detailed explanation how this structure works (on cppreference.com for example).

justHelloWorld
  • 6,478
  • 8
  • 58
  • 138
  • 2
    `std::unordered_map` is not necessarily a vector of vectors. It can also be implemented using chaining with linked lists (in which case there would be no "default size" for the internal vectors). Check out http://stackoverflow.com/questions/31112852/how-stdunordered-map-is-implemented Bad question, but helpful answers. – Aposhian Apr 17 '16 at 06:23
  • 2
    std::unordered_map is a fancy name for [Hash table](https://en.wikipedia.org/wiki/Hash_table). Since C++ standard doesn't specify implementation for standard containers, only interfaces and time-space complexities, it is impossible to give you an answer in a general case. The wikipedia article may shed some light on hash table implementations in general. –  Apr 17 '16 at 06:24
  • The answers to all your questions (except about the requirements on the key type)depend on the implementation you are using. But the answer to sone of tgem you can find out during runtime e.g. via `bucket_count`. – MikeMB Apr 17 '16 at 06:52

2 Answers2

6

std::unordered_map is the standard C++ hash table. It used to be called hash_map in STL, but missed the boat when many of STL's interfaces were merged into C++ in 1998, and by 2011, so many libraries had their own hash_map, C++ had to pick another name (I think "unordered" was a great choice; assuming order in a hash table is a common source of bugs).

is it correct to assume that the buckets are the "internal" vectors?

no, it is both incorrect (incompatible with iterator invalidation requirements) and dangerous (under that assumption you may end up subtracting pointers to elements in the same bucket).

In real life, the buckets are linked lists; e.g.

is it correct to assume that we have to define the equal method on the key type (in addiction to the hash operator) in order to find the key inside the bucket?

Yes, locating the key in a bucket is exactly what the 4th template parameter of std::unordered_map is for (does not need to call the "equal method on the key type" literally, of course)

what is the external vector (hash table) size by default?

There is no "external vector". The number of buckets for a default-constructed std::unordered_map is implementation-defined, you can query it with bucket_count.

what is the internal vector size by default?

There is no "internal vector". The size of any given bucket equals the number of elements currently placed in the bucket. You can query it with bucket_size

what happens if the number of elements in one bucket becomes too big?bor in other words, when the rehash happens?

Nothing happens if the number of elements in one bucket becomes too big. But if average number of elements per bucket (aka load_factor) exceeds max_load_factor, rehash happens (e.g. on insert)

Cubbi
  • 46,567
  • 13
  • 103
  • 169
  • Why are the buckets linked lists rather than (small) vectors? In general, vectors usually perform better than linked lists (because of the better cache locality) even when they appear to have worse big-O behaviour. – Martin Bonner supports Monica Apr 17 '16 at 16:27
  • 2
    Bother. It's because of iterator invalidation. Insertion only invalidates iterators to other elements if it causes a rehash ... whereas insertion would invalidate the iterators to the elements if the bucket if the bucket vector had to be reallocated. – Martin Bonner supports Monica Apr 17 '16 at 16:29
  • @martin: also, the buckets are typically very small. – rici Apr 17 '16 at 17:07
2

This may help you understand the buckets: http://www.cplusplus.com/reference/unordered_map/unordered_map/bucket_count/ http://www.cplusplus.com/reference/unordered_map/unordered_map/max_load_factor/

But generally, yes, the buckets are something like internal vectors. It needs an equality operator (or a predicate) to distinguish keys that had the same hash as you suggest.

The initial number of buckets is possibly 0. It can be set via rehash() or reserve() (they have slightly different semantics.)

http://www.cplusplus.com/reference/unordered_map/unordered_map/rehash/

In an ideal case, each bucket will only have the one item. You can check this by using bucket_size. When the load factor (total items vs. bucket count) gets high, it rehashes automatically.

By default, it'll aim for a 1:1 load factor. If the hash function is good, this may last until max_bucket_count items are inserted.

Keep in mind that the specific implementation of this may vary. Each implementation (e.g. from different platforms or standard libraries) really only needs to have the correct semantics.

If these answers are important to your program, you can query the values as I've described. If you're just trying to wrap your head around it, query them in some test scenarios and it may become more clear.

Todd Christensen
  • 1,297
  • 8
  • 11