3

I'm trying to understand hashing in the context of std::unordered_set. Cppreference gives this explanation:

Internally, the elements are not sorted in any particular order, but organized into buckets. Which bucket an element is placed into depends entirely on the hash of its value. This allows fast access to individual elements, since once a hash is computed, it refers to the exact bucket the element is placed into.

My problem is that I cannot understand what it looks like in memory. Perhaps this question should address hash tables in general, but this may have too much dependence on implementation as all the explanations of hash tables I managed to find were abstract.

As far as I can guess, buckets are dynamically allocated contiguous pieces of memory, something like arrays, but they are not necessarily contiguous relative to each other, whereas hashes are something like pointers to these arrays stored contiguously.

Am I correct on this?

The Quantum Physicist
  • 24,987
  • 19
  • 103
  • 189
Kaiyakha
  • 1,463
  • 1
  • 6
  • 19
  • 2
    Have you tried to implement a hash-table yourself? I highly recommend you find a tutorial and try it. That should give you a very good insight how it works and what "it looks like in memory". As one of the basic data-structures it should really be part of any decent computer science course. – Some programmer dude Sep 01 '22 at 11:29
  • 1
    It can be a vector. It can be a dynamically allocated array. It can even be a binary tree map. There are trade-offs for each kind. I don't know if the standard specifies something specific or the standard library devs are free to do what they please. – The Quantum Physicist Sep 01 '22 at 11:31
  • Check this diagram - https://upload.wikimedia.org/wikipedia/commons/7/7d/Hash_table_3_1_1_0_1_0_0_SP.svg And [read the following manual about hash table with buckets](https://en.wikipedia.org/wiki/Hash_table) – Victor Gubin Sep 01 '22 at 11:31
  • 1
    I tried to explain how hash tables work more visually in [this old answer](https://stackoverflow.com/a/30567466/410767). – Tony Delroy Sep 01 '22 at 12:51
  • 3
    "buckets are dynamically allocated contiguous pieces of memory, something like arrays, but they are not necessarily contiguous relative to each other, whereas hashes are something like pointers to these arrays stored contiguously. Am I correct" - no. Over simplifying a little, think of the separate chaining implementations of hash tables - like `std::unordered_set` and ...`map` - as `std::vector> buckets_;`. You have a contiguous array of "buckets". "Hashes" are not pointers - they're numbers that are %-ed into the bucket count to create bucket-array indices. – Tony Delroy Sep 01 '22 at 12:55
  • The exact layout in memory also depends on the allocator – MarkB Sep 02 '22 at 00:28

1 Answers1

0

The standard implementations of std::unordered_set / std::unordered_map use a separate chaining hash table.

This means that the hash collisions are solved chaining together the collided items through a linked list, that can be traversed to access the item with a unique search key. The internal organization of memory depends on several factors, including the implementation choices of the individual library, but it is essentially a dynamic-size array of pointers, that always point to the first element of its linked list, if one is present.

This is a common and simple method, but that often leads to a rapid performance degradation, especially with high overloads (load factors >0.8) and that requires more memory usage to store the information necessary to keep track of the nodes of the linked lists.

If you are interested in delving further into its data structure, although it is quite a bit of a read, you can read about the GCC implementation of the hash table used by the std::unordered_set / std::unordered_map classes as the underlying container, i.e. _Hashtable. The header file is hashtable.h.

LoS
  • 448
  • 1
  • 3
  • 15