8

I need a fast, simple hash function that creates a unique identifier for a pair of uint32_t values - so the same hash value for (2,7) and (7,2).

Any idea?

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • 2
    Create a uint64 by bitshifting the smaller (or larger) of the two numbers and adding the other. Then you just need to hash the 64bit int. (Alternatively, have the hash function copy the pair, then swap the elements to guarantee order and apply the real hash on that pair) – David Rodríguez - dribeas Jul 20 '13 at 18:49
  • @DavidRodríguez-dribeas: Yeah, meanwhile I came up with the solution, thanks. I already had the bitshift stuff for it, but you remembered me that the comparison is the magic. – plasmacel Jul 20 '13 at 18:50

2 Answers2

6

To answer my own question, the solution is:

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    if (x < y) return (b << 32) | a;
    else return (a << 32) | b;
}

Which can be improved to the branchless version

uint64_t hash(uint32_t x, uint32_t y)
{
    const uint64_t a = static_cast<uint64_t>(x);
    const uint64_t b = static_cast<uint64_t>(y);

    const uint64_t h0 = (b << 32) | a;
    const uint64_t h1 = (a << 32) | b;

    return (x < y) ? h0 : h1; // conditional move (CMOV) instruction
}

These methods are perfect hash functions - they guarantee zero collisions. However, they have the disadvantage that you cannot hash values above 2^32 - 1.

plasmacel
  • 8,183
  • 7
  • 53
  • 101
  • 2
    I like the idea of shifting, it's natural and there is no need to proof uniqueness. If you want to deal with values above 2^32 you can return a string as a unique identifier, where you reserve a special symbol to separate two parts of a hash (changing representation base to larger than 10 is a good idea also) – pkacprzak Jul 20 '13 at 22:51
  • "changing representation base to larger than 10 is a good idea also" - how you mean that? – plasmacel Jul 20 '13 at 22:59
  • 1
    if you are representing number as a string, then you can use a larger alphabet than {0,1, ..., 9} e.g. hexadecimal system or even larger. The larger base the shorter representation. – pkacprzak Jul 20 '13 at 23:05
  • 1
    I have used `if (x>y)` when I had the exact same need as yours, but now that I think about it, `hash(x) ^ hash(y)` would have done as well (with the issue that all pairs of identical elements `(x, x)` have the same hash. That can be a problem for some applications) – Pascal Cuoq Jul 28 '13 at 13:52
  • 2
    @plasmacel signed int right shifting is implementation-defined, and your `max << 32` invokes undefined behaviour (because you are shifting by an amount >= datatype size). Only after the `|` the result is guaranteed to be cast to `uint64_t`. In practice your compiler is doing 64bit computation anyway, but this cannot be assumed for sure. – Astrinus Mar 06 '17 at 10:18
2
constexpr uint32_t hash_max = ...;    

constexpr uint32_t commutative_hash(uint32_t i, uint32_t j) {
   return (i*j + (i*i)*(j*j) + (i*i*i)*(j*j*j)) % hash_max;
};

Extra parentheses are for compiler - it will be easier to optimize this expression.

Do not use any conditional instruction (or std::max/std::min) which breaks CPU pipeline (and is slow) if you want to make a fast function.

Leonid Volnitsky
  • 8,854
  • 5
  • 38
  • 53
  • Thanks, but this is extremely slow due to the lot of multiplications. See my answer. – plasmacel Jul 20 '13 at 19:30
  • 1
    I can bet that my function is faster. Did you benchmark your function against mine? I've added explanation in the answer why it is faster. – Leonid Volnitsky Jul 20 '13 at 19:39
  • @LeonidVolnitsky +1 for the right explanation. The conditional statement might confuse branch prediction in some cases. – NumberFour Jul 20 '13 at 19:44
  • @LeonidVolnitsky: Indeed you are right, yours is faster. I'm surprised that conditionals so much slow down the things. Could you explain it in more details? – plasmacel Jul 20 '13 at 20:10
  • 1
    On modern CPUs with pipeline - yes. See also https://sites.google.com/site/dataanxiety/blog/importantnumbersincomputerscience – Leonid Volnitsky Jul 20 '13 at 20:22
  • I edited your function to correctly handle zero inputs. See my answer update. – plasmacel Jul 20 '13 at 20:59