74

How does C++ STL unordered_map resolve collisions?

Looking at the http://www.cplusplus.com/reference/unordered_map/unordered_map/, it says "Unique keys No two elements in the container can have equivalent keys."

That should mean that the container is indeed resolving collisions. However, that page does not tell me how it is doing it. I know some ways to resolve collisions like using linked lists and/or probing. What I want to know is how the c++ STL unordered_map is resolving it.

whiteSkar
  • 1,614
  • 2
  • 17
  • 30
  • 2
    It's implementation dependent. A language reference won't tell you how it's done, because it is not specified in the standard how it should be done. – Benjamin Lindley Feb 03 '14 at 02:04
  • 2
    libstdc++ uses linear chaining, but other implementations of the STL may use other techniques – Brian Bi Feb 03 '14 at 02:09
  • @BrianBi: So does the current libc++. I wonder if the standard even permits open addressing... – Kerrek SB Feb 03 '14 at 02:13
  • We are using different implementations of STL? I thought STL is "the" library every C++ programs use when using the STL. – whiteSkar Feb 03 '14 at 02:16
  • 3
    Technically, we're not using the STL at all. We're using the C++ Standard Library, which is an interface specification. And yes, it has multiple implementations. – Benjamin Lindley Feb 03 '14 at 02:18
  • for what it's worth, even the hashing algorithm or the implementation of `std::hash` is implementation-defined, if you are trying to get a specific answer to this, there is very little that can be said about it according to the standard. – user2485710 Feb 03 '14 at 02:18
  • Ah I get it. So if libstdc++ is an implementation of the STL and the STL says "No two elements in the container can have equivalent keys.", how come it uses linear chaining to resolve collisions? Doesn't linear chaining puts the elements with the same hash key in the same hash key place? – whiteSkar Feb 03 '14 at 02:26
  • 2
    No, the STL was a library decades ago which had ideas stolen and incorperated into the C++ standard library. Each compliler of C++ must implement the standard library through whatever means it chooses. People tend to call the parts of the standard library inspired by the STL the STL, albeit incorrectly. – Yakk - Adam Nevraumont Feb 03 '14 at 03:45
  • 6
    "No two elements in the container can have equivalent keys" is a condition on `Pred` parameter of `unordered_map`. The notion of collisions applies to `Hash` parameter. Two keys may not be equivalent but may still hash to the same value - the very definition of hash collision. – Igor Tandetnik Feb 03 '14 at 04:02
  • @Igor Ah right, I got confused with the key and the hash value. Thanks for clearing it up! – whiteSkar Feb 03 '14 at 05:06
  • The question uses the word Keys, but it looks like you're really asking about Hash Codes, which is what the answer below describes. Please clarify the question text. – David Gardner Jan 15 '15 at 21:04
  • @whiteSkar Collisions are recommeded to be unlikely to cause collisions to avoid different elements ending up in a same bucket, because a later search will imply a longer linear search. A hash collision from different keys won't make a insertion fail because "an element already exists", for instance. Besides, two different keys with different hashes can still cause bucket collision if both hashes are multiple of `bucket_count`. That's why some implementations (like *gcc*) use a prime number as `bucket_count`, to decrease the probability of modular arithmetic causing a collision. – ABu Sep 06 '19 at 21:38

2 Answers2

92

The standard defines a little more about this than most people seem to realize.

Specifically, the standard requires (§23.2.5/9):

The elements of an unordered associative container are organized into buckets. Keys with the same hash code appear in the same bucket.

The interface includes a bucket_count that runs in constant time. (table 103). It also includes a bucket_size that has to run in time linear on the size of the bucket.

That's basically describing an implementation that uses collision chaining. When you do use collision chaining, meeting all the requirements is somewhere between easy and trivial. bucket_count() is the number of elements in your array. bucket_size() is the number of elements in the collision chain. Getting them in constant and linear time respectively is simple and straightforward.

By contrast, if you use something like linear probing or double hashing, those requirements become all but impossible to meet. Specifically, all the items that hashed to a specific value need to land in the same bucket, and you need to be able to count those buckets in constant time.

But, if you use something like linear probing or double hashing, finding all the items that hashed to the same value means you need to hash the value, then walk through the "chain" of non-empty items in your table to find how many of those hashed to the same value. That's not linear on the number of items that hashed to the same value though--it's linear on the number of items that hashed to the same or a colliding value.

With enough extra work and a fair amount of stretching the meaning of some of the requirements almost to the breaking point, it might be barely possible to create a hash table using something other than collision chaining, and still at least sort of meet the requirements--but I'm not really certain it's possible, and it would certain involve quite a lot of extra work.

Summary: all practical implementations of std::unordered_set (or unordered_map) undoubtedly use collision chaining. While it might be (just barely) possible to meet the requirements using linear probing or double hashing, such an implementation seems to lose a great deal and gain nearly nothing in return.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • "It's still possible to meet the requirement [using linear probing or double hashing]" / "Then you walk through the "chain" of occupied items to find a slot where you can insert this item." isn't compatible with the "Keys with the same hash code appear in the same bucket." requirement. Of course, it's debatable if collision chaining puts anything "in the same bucket" anyway, but you're *clearly* proposing putting keys with the same hash in different buckets. 23.2.5/9 seems an unfortunate restriction... probing is useful for more than "load factor low" - many benefits from avoiding heap.... – Tony Delroy Feb 03 '14 at 04:46
  • 1
    @TonyD: not really--in this case, a "bucket" is no longer a physical thing, but simply all the slots that hashed to the same value. You'd have to build enough intelligence into local_iterator to skip across any colliding items, but it still looks entirely possible to me. – Jerry Coffin Feb 03 '14 at 04:55
  • If it's that flexible, "Keys with the same hash code appear in the same bucket." has no meaning at all. Agree to disagree etc.. Cheers. – Tony Delroy Feb 03 '14 at 05:21
  • @TonyD: Not really. A bucket is the set of items with keys that hashed to a particular value, and which can be iterated with a local_iterator. Given that they're specifically avoiding specifying an implementation, it really *shouldn't* mean much about the implementation. – Jerry Coffin Feb 03 '14 at 05:24
  • Well, finding a mutually-agreed authoritative definition of "bucket" from e.g. Dijkstra isn't easy with Google, so let's just consider whether your postulated implementation can meet other requirements. Per your rationale for needing `count`, `b.find(k)` can't be worst-case O(`b.size()`) where `b` is a "logical" bucket `size()`. Further, `b.begin(n)`, `b.end(n)` et al are required to have constant complexity, so would need to be stored alongside your `count` at the originally hashed-to bucket. – Tony Delroy Feb 03 '14 at 06:30
  • @JerryCoffin great explanation thank you - I'm curious though what you mean by being "linear on the number of items that hashed to the same or a colliding value". Colliding values, by definition, are ones that hash to the same value, right? So what does "or a colliding value" mean? – rb612 Apr 17 '18 at 01:17
  • @rb612: Hashing is usually done in two pieces. You start by computing some large, fixed-sized value (e.g., 64 bits). Then you reduce that to the range of indices for your hash table. As I was using the terms, "hashing to the same value" would mean the original hashes were equal, and colliding referred to the reduced range one being equal. – Jerry Coffin Oct 07 '18 at 17:45
  • I perceive another heavy implication about implementation: Chaining seems inextricably related (by cause or consequence, I don't know; I haven't read any original rationale) with the Standard's guarantee that references stay valid forever, doesn't it? By my understanding, with open addressing, erasing would mean keeping around 'dummy' elements to avoid shifting others (rather than only optionally optimising through use of this like some open addressing maps), and insertion is precluded by not being able to put elements at the right place (at least without rebuilding or over-reserving buckets). – underscore_d Jan 02 '19 at 18:32
  • @underscore_d: Yeah, that would get...ugly as well. Haven't thought through carefully enough to be sure it's truly impossible, but even at best it would be ugly. – Jerry Coffin Jan 02 '19 at 21:14
  • Yeah, it might be possible to guarantee reference validity under open addressing, for highly specific contents and/or spans of time. However, it all seems highly contrary to ever being a general-purpose container, so it naturally follows that we must choose one or the other. I see someone else came to the same conclusion as me - and doesn't like it! - [here](https://stackoverflow.com/a/23924475/2757035), although I'm not so critical given that my current use for `std::unordered_map` uses references to avoid a 2nd round of lookups. Then again, if lookup were faster, I wouldn't need that... haha – underscore_d Jan 02 '19 at 22:09
  • Getting back to how/why the Standard does what it does, [this answer](https://stackoverflow.com/a/31113618/2757035) quotes the original 2003 proposal, which did consider and ultimately discount open addressing due to technical complexities and a lack of proven implementation experience. I guess things might be different today, at least judging by the number and benchmarks of open addressing implementations I've been reading about. – underscore_d Jan 02 '19 at 22:22
5

I found this answer looking for how to detect when my types are colliding, so I will post this in case that is the intent of the question.:

I believe there's some misconception about "Unique keys No two elements in the container can have equivalent keys."

look at the code below

//pseudocode
std::unordered_map<int, char> hashmap;
hashmap[5] = 'a';
hashmap[5] = 'b'; //replace 'a' with 'b', there is no collision being handled.

I think the Jerry's answer is referring to the internal system that it uses to shrink keys to appropriate array indices.

If you want collisions to be handled for your types (with buckets), you need std::unordered_multimap and will have to iterate over

Hopefully this code can be read without the context I generated it with. it basically checks to see if any element in the bucket associated with the hash is the element I'm looking for.

//sp is std::shared_ptr
//memo is std::unordered_multimap< int, sp<AStarNode> >

//there's probably multiple issues with this code in terms of good design (like using int keys rather than unsigned)

bool AStar_Incremental::hasNodeBeenVisited(sp<AStarNode> node)
{
    using UMIter = std::unordered_multimap<int, sp<AStarNode> >::iterator;

    bool bAlreadyVisited = false;

    //get all values for key in O(1*)
    int hash = WorldGrid::hashGrid(node->location);
    std::pair<UMIter, UMIter> start_end = memo.equal_range(hash); //bucket range
    UMIter start = start_end.first;
    UMIter end = start_end.second;

    //hopefully this is implemented to be O(m) where m is the bucket size.
    for(UMIter bucketIter = start; bucketIter != end; ++bucketIter)
    {
        sp<AStarNode> previousNode = bucketIter->second;
        sf::Vector2i& previousVisit = previousNode->location;
        if (previousVisit == node->location)
        {
            bAlreadyVisited = true;
            break;
        }
    }

    return bAlreadyVisited;
}
Enigma22134
  • 532
  • 5
  • 12