-1

I'm trying to understand Unordered Maps and Hashing. As I understand it, Unordered Maps have a Hash function inside of it that takes an object of type T, and returns an int, which then uses the int as an index to an internal array. It uses a List of the object of type T in the array position, so that if there's something already in the spot, additions are inserted into the List.

Conceptually, would using a Set instead of a List improve efficiency?
(maybe somehow binary search and a Set being ordered helps over having a List)

Or maybe a Vector instead of the List? (maybe random access helps over the List.)

  • The duplicate this was closed for doesn't address the question here, which is "would using a set [or vector] instead of a linked list [in each bucket] improve efficiency?". – Tony Delroy May 02 '21 at 17:20

2 Answers2

2

The datatype should not matter much, because in most cases, the container at the hashed index only contains zero or one element. If you regularly have many elements there, the hash map degrades in performance anyway. The remedy for that is to resize the initial array, which std::unordered_map<> does itself. However, if you have a bad hash function which causes many hash collisions, switching the hash function is necessary for proper operation.

Ulrich Eckhardt
  • 16,572
  • 3
  • 28
  • 55
2

If there're often a lot of collisions at the same bucket, then using a set is more efficient than using a list, and indeed some Java hash table implementations have adopted sets for this reason. vectors can't be used for std::unordered_map or std::unordered_set implementations, as they need to reallocate to a different memory area when grown past their capacity, whilst the Standard requires that the elements in an unordered container are never moved by other operations on the container.

That said, the nature of hash tables is that - with a high quality hash function - the statistical distribution of number-of-elements colliding in particular buckets relates only to the load factor. If you can't trust the collisions not to get out of control, perhaps you shouldn't be using that hash function.

Some details: Standard-library unordered containers have a default max_load_factor() (load_factor() is the ratio of size() to bucket_count()) of 1.0, and with a strong pseudo-randomizing hash function they'll have 1/e ~= 36.8% of buckets empty, as many with one element, half that with 2 elements (~18.4%), a third of that with 3 elements (~6.13%), a quarter of that with 4 elements (~1.53%), a fifth of that with 5 elements (~0.3%), a sixth of that with 6 elements (~0.05%). As you can hopefully see, it's incredibly rare to have to search through many elements (even in the worst case scenario where the hash table is at its max load factor), so a list approach is usually adequate.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252