0

In C++, I often would like to created an unordered std::unordered_set or std::unordered_map of some type like std::tuple<int,int,int> or std::pair<int,int>. But I haven't found a way to do so that is fast. My custom hash functions are all slow. By slow I mean that programs that use them run faster if use the ordered version (set or map).

Here's an example of a custom hash function I created. I'm following examples I found on the internet. This type of example seems to be common. (An exclusive-or between primitive fields in a large data type).

using namespace std;
typedef pair<int,int> pi;
struct pair_hash
{
    size_t operator()(const pi& p) const
    {
        auto h1 = hash<int>{}(p.first);
        auto h2 = hash<int>{}(p.second);
        return h1 ^ h2;
    }
};

Here's another one:

using namespace std;
typedef tuple<int,int,int,int> t4;

struct t4_hash
{
    size_t operator()(const t4 t) const
    {
        auto h1 = hash<int>{}(get<0>(t));
        auto h2 = hash<int>{}(get<1>(t));
        auto h3 = hash<int>{}(get<2>(t));
        auto h4 = hash<int>{}(get<3>(t));
        return h1 ^ h2 ^ h3 ^ h4;
    }
};
composerMike
  • 966
  • 1
  • 13
  • 30
  • 5
    How are you profiling these to compare the performance? On small amounts of data, ordered collections are going to perform better. The unordered containers only scale better at higher input sizes. – Silvio Mayolo May 03 '21 at 02:30
  • Your hash is fast. Hashtables are merely slower than ordered sets for small amounts of data. – Mooing Duck May 03 '21 at 02:32
  • My use for sets is in a competitive programming problem, so I don't have access to the test cases directly but they are described to be on the order of hundreds of thousands of data points. My program that uses ordered sets finishes in well under 1 second, while the program using an unordered set times out at greater than 2 seconds. – composerMike May 03 '21 at 02:32
  • Do you have control over how you're building the programs? If you're timing an unoptimized build, then the timings are meaningless. What are the compiler options being used? – PaulMcKenzie May 03 '21 at 02:33
  • 2
    It's possible that the kinds of inputs they're using hash to the same values using this hash. E.g. `(a, b)` and `(b, a)` would always hash to the same thing, `(a, a)` would always hash to 0, etc. – Justin May 03 '21 at 02:35
  • No I don't have control over the builds. I guess I should just figure I'm doing nothing wrong but for competitive programming I should test both ordered and unordered and most like go with ordered. – composerMike May 03 '21 at 02:35
  • *No I don't have control over the builds.* -- Then that's a big issue. Until you put together a complete program that simply stress tests the usage of these structures, you're not going to get a concrete answer. Too many times questions are posted similar to yours, and it takes someone to prove that the claims are wrong when a complete program is built with optimizations and tested. As a "consumer" of these "competitive programming" sites, you should demand to see how your application is being built. C++ can have vast differences in speed depending on how a program is built. – PaulMcKenzie May 03 '21 at 02:37
  • Could be you are getting a lot of hash collisions. `unoredered_set` is basically a `vector` so has collisions are going to kill you on the list traversal. – NathanOliver May 03 '21 at 02:49
  • These are not good hash functions. Try something like `(h1 << 3) ^ h2` for the first one, and similarly for the second one, shifting all but the last operand by different amounts.. – user207421 May 03 '21 at 03:35
  • Related: https://stackoverflow.com/questions/2196995/is-there-any-advantage-of-using-map-over-unordered-map-in-case-of-trivial-keys – Jerry Jeremiah May 03 '21 at 03:53
  • @PaulMcKenzie I recall seeing somewhere what compiler options they use. I can't find it now but I do believe on at least two sites they give the specifics. I think they use some optimization. I think the lesson, though, is: don't assume the unordered versions of set and map are necessarily faster. Test them. – composerMike May 04 '21 at 20:37
  • No. The lesson is not to assume that your hash function is optimal. It isn't. The unordered implementations cannot be any better than your hash function. – user207421 May 05 '21 at 00:42
  • @user207421 Can you explain why they are not optimal and why your suggestion is better? Does it have to do with the case where all the fields have the same value? I know that case was not possible in my code (unless there was a bug of course). – composerMike May 06 '21 at 02:55
  • It would certainly break down in the case where all fields have the same value. You would get 0 or -1. In general you don't see hash functions of the form you used. You always see shifts or multiplications by a prime number, to spread out the values and make collisions less likely. For example try `((h1 << 3) ^ h2) * 37)`. – user207421 May 07 '21 at 01:55

0 Answers0