Unordered map of unordered set in C++ 11

Question

I wanted to implement something, that maps an unordered set of integers to an integer value. Some kind of C++ equivalent of Python dict, which has sets as keys and ints as values.

So far I used std::map<std::set<int>, int> set_lookup; but from what I understood this is unnecessarily slow as it uses trees. I don't care about the ordering, only speed is important.

From what I have understand, the desired structure is std::unordered_map<std::unordered_set<int>, int, hash> set_lookup; which needs a hash function to work.

Is this the right approach? And how would a minimum running example look like? I couldn't find how the hash part should look like.

As an academic exercise ("how to hash an unordered set of integers") this can be answered. However, before that, what specific real-world problem are you trying to solve? There may be a better implementation than a map of sets. Do all possible sets have a unique integer assignment? Or can a given integer be mapped to multiple different sets? — JohnFilleau, Apr 06 '22 at 14:27
If I try to use a set as a dict key in Python, I encounter "TypeError: unhashable type: 'set'". Don't you? — molbdnilo, Apr 06 '22 at 14:32
@molbdnilo You must freeze the set in Python (see `frozenset`). It doesn't make sense if the keys can change. — c z, Jul 18 '23 at 08:29

wohlstad · Answer 1 · 2022-06-01T13:54:57.837

It isn't clear whether you ask about the syntax for defining a hash function, or about how to define a mathematically good hash for a set of ints.

Anyway - in case it is the former, here is how you should technically define a hash function for your case:

template <>
struct hash<std::unordered_set<int>>
{
    std::size_t operator()(const std::unordered_set<int>& k) const
    {
        using std::size_t;
        using std::hash;
        using std::string;
    
        // ...
        // Here you should create and return a meaning full hash value:
        return 5;
    }
};
    
void main()
{
    std::unordered_map<std::unordered_set<int>, int> m;
}

Having written that, I join the other comments about whether it is a good direction to solve your problem. You haven't described your problem, so I cannot answer that.

score 0 · Answer 2 · answered Apr 13 '22 at 20:27

I understood [std::map<std::set<int>, int> set_lookup;] is unnecessarily slow as it uses trees.

Is [std::unordered_map<std::unordered_set<int>, int, hash>] the right approach?

It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:

to be able to hash keys
to be able to compare keys

To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine (which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>, or std::array<> if the size is known at compile time.

The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<> or std::array<> would be best.

That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.

If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>. To find the int associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set or sorted vector<> or array<>, as you can present the elements to something like hash_combine in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);). Storing the vector<int> means you can traverse the unordered_map later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the ints by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>.

Unordered map of unordered set in C++ 11

2 Answers2