3

If we have a set of pointers we know are aligned to sizeof(void *) whats that fastest way to hash them?

Notes:

  • Example use case is taking elements of a pointer array or memory allocations and storing in a hash-map. Noting this because this question isn't about the kind of cryptographic hashing needed for passwords, security etc.

  • By fixed size int, I mean we know the exact size of the int and it wont vary (perhaps this is important since some hashing libraries use intptr_t or size_t for their hashing return values which might give a different answer to this question).

  • By portable, this should work for 32, 64 bit, big & little endian.

  • (uint32_t)(((intptr_t)p) >> 2) gives good results for 32bit, big endian, however I imagine it looses significant bits for 64bit systems, and I'm not sure if this gives a usable distribution for little endian.

ideasman42
  • 42,413
  • 44
  • 197
  • 320

2 Answers2

3

When mod math is fast, a quick hash is to mod by a prime <= TARGET_TYPE_MAX. The mod will use all the bits of the p to form the hash.

By using the largest prime, only a few buckets are lost, but speed is the goal.

Example, if the target tpye is uint32_t, use 4294967291u.

With variant sized integer types like int, use macros to select the precomputed prime. Primes just less than a power of two.

#define LARGEST_PRIME8 251u
#define LARGEST_PRIME15 32749u
#define LARGEST_PRIME16 65521u
#define LARGEST_PRIME31 2147483647u
#define LARGEST_PRIME32 4294967291u
#define LARGEST_PRIME63 9223372036854775783u
#define LARGEST_PRIME64 18446744073709551557u

uint32_t hash = (uint32_t) ((uintptr_t)(void *)p) % LARGEST_PRIME32);
chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    Two things, firstly, with pointer aligned values, this is guaranteed to give poor distribution (bit shift can solve of course). Second modulo operations are known to be slow, many C hashing libraries use bit-shifting instead (GTK's `GHashTable` for eg). Is there something about these sizes that allows them to be computed faster than a regular modulo? – ideasman42 Nov 02 '18 at 01:12
  • 2
    @ideasman42 Any compiler worth its bits will do strength-reduction for division / modulo with known divisor. – Deduplicator Nov 02 '18 at 01:57
  • @ideasman42 Disagree with "this is guaranteed to give poor distribution" - modding by a prime flattens distribution of items like pointers which tend to have clumped bit values. As to "Second modulo operations are known to be slow" does not apply here as 1 this answer starts with the premise "When mod math is fast" and 2, good compilers are know to well optimize modulo based on fixed constant divisors.. – chux - Reinstate Monica Nov 02 '18 at 02:54
  • If the pointers are aligned to 8 bytes for eg, then modulo by a large number will still leave gaps between the values, if the memory locations are from one region of memory (a larger array allocation for eg). As for modulo w/ a fixed size, I was aware the fixed value can be optimized out, but would still want to compare it with bit shifting and xor for eg. – ideasman42 Nov 02 '18 at 05:22
  • @ideasman42 With a 64-bit pointer and mod 4294967291 to hash to a `uint32_t`, there are not large gaps. Perhaps a numeric example to express your point abut _large gaps_. – chux - Reinstate Monica Nov 02 '18 at 05:28
  • Not large gaps, but being aligned to pointer size causes even, small gaps the size of a pointer. With this Python expression, these values are still 8 apart: `[i % 4294967291 for i in range(9223372036854775783, 9223372036854775783 + 100, 8)]` (9223372036854775783 is just an example of a large number standing in place for a 64bit memory address) – ideasman42 Nov 02 '18 at 05:32
  • @ideasman42 True, the values remain 8 apart. For a [cryptographic hash](https://en.wikipedia.org/wiki/Cryptographic_hash_function) that is bad, but OP specified "this question isn't about the kind of cryptographic hashing ". Using the hash for a table look-up, having the hashes being usually different is sufficient. – chux - Reinstate Monica Nov 02 '18 at 05:42
  • 1
    Just a note: Primes one less than a power of two are known as [Mersenne primes](https://en.wikipedia.org/wiki/Mersenne_prime); and numbers that are one less than a power of two as Mersenne numbers. – Nominal Animal Nov 03 '18 at 00:05
3

If you are okay creating a 64bit in -> 64 bit out restriction, the mumur3 hash finalizer function has very good properties.

Here's the 64 bit one (from discussion here: http://zimbry.blogspot.com/2011/09/better-bit-mixing-improving-on.html)

UInt64 MurmurHash3Mixer( UInt64 key )
{
  key ^= (key >> 33);
  key *= 0xff51afd7ed558ccd;
  key ^= (key >> 33);
  key *= 0xc4ceb9fe1a85ec53;
  key ^= (key >> 33);
  return key;
}

Some additional discussion about discovering such functions, including 32 bit -> 32bit variants. https://nullprogram.com/blog/2018/07/31/

Googling around with terms like "full avalanche" or "mumur3 mixing vs..." should get you a seemingly infinite amount of things to read.

one more link: How to create a custom Murmur Avalanche Mixer?

dpzmick
  • 454
  • 1
  • 3
  • 9