I understood [std::map<std::set<int>, int> set_lookup;
] is unnecessarily slow as it uses trees.
Is [std::unordered_map<std::unordered_set<int>, int, hash>
] the right approach?
It depends. If your keys are created then not changed, and you want to be able to do a lot of lookups very fast, then a hash-table based approach would indeed be good, but you'll need two things for that:
- to be able to hash keys
- to be able to compare keys
To hash keys, deciding on a good hash function is a bit of an art form. A rarely bad - but sometimes slower than necessary - approach is to use boost hash_combine
(which is short enough that you can copy it into your code - see here for the implementation). If your integer values are already quite random across most of their bits, though, simply XORing them together would produce a great hash. If you're not sure, use hash_combine
or a better hash (e.g. MURMUR32). The time taken to hash will depend on the time to traverse, and traversing an unordered_set
typically involves a linked list traversal (which typically jumps around in memory pages and is CPU cache unfriendly). The best way to store the values for fast traversal is in contiguous memory - i.e. a std::vector<>
, or std::array<>
if the size is known at compile time.
The other thing you need to do is compare keys for equality: that also works fastest when elements in the key are contiguous in memory, and consistently ordered. Again, a sorted std::vector<>
or std::array<>
would be best.
That said, if the sets for your keys are large, and you can compromise on a statistical guarantee of key equality, you could use e.g. a 256-bit hash and code as if hash collisions always correspond to key equality. That's often not an acceptable risk, but if your hash is not collision prone and you have e.g. a 256 bit hash, a CPU could run flat-chat for millennia hashing distinct keys and still be unlikely to produce the same hash even once, so it is a use I've seen even financial firms use in their core in-house database products, as it can save so much time.
If you're tempted by that compromise, you'd want std::unordered_map<HashValue256, std::pair<int, std::vector<int>>>
. To find the int
associated with a set of integers, you'd hash them first, then do a lookup. It's easy to write a hash function that produces the same output for a set
or sorted vector<>
or array<>
, as you can present the elements to something like hash_combine
in the same sorted order during traversal (i.e. just size_t seed = 0; for (auto& element : any_sorted_container) hash_combine(seed, element);
). Storing the vector<int>
means you can traverse the unordered_map
later if you want to find all the key "sets" - if you don't need to do that (e.g. you're only ever looking up the int
s by keys known to the code at the time, and you're comfortable with the statistical improbability of a good hash colliding, you don't even need to store the keys/vectors): std::unordered_map<HashValue256, int>
.