Why does case insensitive work with std::hash function for std::unordered_set?

Question

the question I have is not about using a case-insensitive std::unordered_set, but rather how does it work?

#include "stdafx.h"
#include <string>
#include <iostream>
#include <unordered_set>
#include "boost/algorithm/string.hpp"


struct case_insensitive_comparer
{
    bool operator () (const std::string& x, const std::string& y) const
    {
        return boost::iequals(x, y);
    }
};

using case_insensitive_set = std::unordered_set<std::string, std::hash<std::string>, case_insensitive_comparer>;

std::vector<std::string> permute_case(const std::string& s)
{
    std::vector<std::string> strs;

    // Iterate through all bitmasks, 1 for uppercase, 0 for lowercase
    int msb = 1 << (s.length() - 1);
    int upper = 1 << s.length();
    std::locale loc;
    for (int i = 0; i < upper; i++)
    {
        int bit = msb;
        std::string current = "";
        for (size_t j = 0; j < s.length(); j++, bit >>= 1)
            current += (bit & i) ? std::toupper(s[j], loc) : std::tolower(s[j], loc);

        strs.push_back(current);
    }

    return strs;
}

int main()
{
    std::vector<std::string> strs = permute_case("awesome");

    case_insensitive_set set(strs.begin(), strs.end());

    // Check the hash
    for (auto& s : strs)
        std::cout << s << " :" << std::hash<std::string>()(s) << "\n";

    // Check the element
    for (auto& s : set)
        std::cout << s << "\n";

    return 0;
}

So I use a string case-insensitive comparer for std::unordered_set and the std::hash<std::string> as the hash function. My basic understanding of hash set(I assume that unordered_set is like a hash set) is that it calculates the key's hash and puts it into the set if it doesn't exist yet. And the comparer Pred is for when the set is trying to insert a key and there is a hash collision, it has to decide if the keys are the same or different.

Based on the code, it works regardless, so some of my assumptions are not correct. It would be helpful if someone tell me which of my assumptions are wrong.

Thank you.

Edit: My expectation for this case insensitive unordered_set is that there should only be 1 key inserted and it was the case that I observed, i.e only AWESOME is showed. So for my case I thought it worked but by the answer from kennytm really, I was just lucky getting all keys to be in the same bucket. I indeed use MSVC to compile the code.

On my machine it prints `AWESOmE` and `AWESOME`, so it *does not* work. — kennytm, Apr 09 '17 at 06:49
What do you mean by `case insensitive work` ? Please explain what you expected and what you observed. — Support Ukraine, Apr 09 '17 at 07:07

score 5 · Accepted Answer · edited May 23 '17 at 11:54

Let's recall how a hash table works.

A hash table with capacity N is an array of buckets. The bucket is usually a linked list or binary search tree. Conceptually you may think of a hash table as

template <typename T>
class HashTable {
    std::vector<std::forward_list<T>> _buckets;

public:
    HashTable(size_t capacity = 16) : _buckets(capacity) {}
    size_t bucket_count() const { return _buckets.size(); }

Every key k ∈ T can be inserted into a bucket of the hash table. Which bucket is chosen, is determined by a function bucket_index which takes the key k and the capacity N as input, and produces an array index 0 ≤ i < N, which bucket should the key belong.

    void insert(T&& key) {
        // locate the bucket.
        size_t i = bucket_index(key, bucket_count());
        auto& bucket = _buckets[i];
        // ensure the key does not already exist in the bucket
        if (std::find(bucket.cbegin(), bucket.cend(), key) == bucket.cend()) {
            // now insert the key into the bucket.
            bucket.push_front(std::move(key));
        }
    }

The bucket_index function is typically implemented in terms of the hash function, and then take the modulus with the capacity:
```
private:
    static size_t bucket_index(const T& key, size_t cap) {
        return std::hash<T>()(key) % cap;
    }
};
```
Note that it does not use std::hash<T>()(key) directly: two keys will refer to the same bucket whenever hash % cap are equal.

And this is why OP's code appear to work on MSVC. In MSVC's implementation of unordered_set, the initial capacity is 8. And then, if you print the hash as hexadecimal, you'll notice the last digit is always c:

AWESOME :7552acc94fd16a5c
AWESOMe :75528cc94fd133fc
AWESOmE :75bf6cc9502dcf7c
AWESOme :75bf8cc9502e05dc
AWESoME :60234cc8b2d194fc
...
awesOme :976734d757ba79dc
awesoME :81caf4d6ba5e08fc
awesoMe :81cb14d6ba5e3f5c
awesomE :815e34d6ba01a3dc
awesome :815e14d6ba016d7c

Therefore, hash % 8 will always be 4, i.e. the same bucket out of eight will be chosen by all 128 keys. Remember what happens after we've chosen a bucket? We check if a key already exists in the linked list, which is always true, so only the first key "AWESOME" will be present.

AWESOME

This gives the illusion where just replacing the == works, while what really happens is just MSVC's hash function has very poor quality.

To further show that OP's code "does not work", let's switch to another standard library. When using clang with libc++, we get these results:

AWESOME :1a285ecfc4bab378
AWESOMe :acb9b7f4f69b16e2
AWESOmE :fd66d9186a434601
AWESOme :254b008bd66d1e29
AWESoME :27cac8154bb934d0
...
awesOme :a4e8c2140834341e
awesoME :cfd12a83da4a4b0f
awesoMe :b4c4eb4c60968581
awesomE :bdca27cd606f4f42
awesome :14ddc089ab5badb5

Unlike MSVC's hash, libc++'s hash is quite uniformly distributed. libc++'s unordered_set's initial capacity is 2, and both buckets are filled, so the set has two elements:

AWESOmE
AWESOME

and OP's code does not work in general.

Note: Here I assumed hash-collision is handled by separate chaining and there is no dynamic resizing, although both of these won't enter the picture since == always returns true.

Wonderful! Thank you for the awesome explanation. It really answers my question. — NNg, Apr 09 '17 at 21:50
You can use reserve() to increase the number of buckets in the unordered_set and then MSVC does show that the original code doesn't work. — Superfly Jon, Feb 06 '18 at 10:59

Paul Floyd · Answer 2 · 2017-04-09T10:05:03.203

The problem is that you have a case sensitive hasher and a case insensitive comparator. If you make the hasher case insensitive then you will get just a single entry.

For instance:

#include <boost/algorithm/string/case_conv.hpp>

struct case_insensitive_hasher
{
  size_t operator()(const std::string& key) const
    {
      std::string keyCopy(key);
      boost::to_lower(keyCopy);
        return std::hash<std::string>()(keyCopy);
    }
};

using case_insensitive_set = std::unordered_set<std::string, case_insensitive_hasher, case_insensitive_comparer>;

The output will just contain AWESOME, the first entry inserted.

I wish we could that without copy... – AgainPsychoX Jan 25 '23 at 21:35 — AgainPsychoX, Jan 25 '23 at 21:35

Why does case insensitive work with std::hash function for std::unordered_set?

2 Answers2