Very fast universal hash function for 128 bit keys

Question

I need a very fast universal hash function for a 128-bit key. The returned value needs to be about 32 bit (well, 16 bit would be sufficient; in most cases I only need 1-4 bits actually).

Universal hash means, there are two parameters: key (128 bit) and index (64 bit). For two keys, the universal hash function needs to return different result eventually, if called with different indexes. So with a different index, the universal hash should behave like a different hash function. For x = universalHash(k, i) and y = universalHash(k, i + 1), it would be best if on average 50% of all bits are different between x and y (randomly). The same for the case if the method is called with different keys. In practise, 5% off is OK for me.

It needs to be very fast (one or two multiplications at most). It is called millions of times. Please don't say: no, you won't need it to be fast. It also needs to return different values eventually.

What I have so far (Java code, but C is (due to the lack of a 128 bit data type, the key is the composite of a and b, which are 64 bit each):

int universalHash(long a, long b, long index) {
    long x = a ^ Long.rotateLeft(b, (int) index) ^ index;
    int y = (int) ((x >>> 32) ^ x);
    y = ((y >>> 16) ^ y) * 0x45d9f3b;
    y = ((y >>> 16) ^ y) * 0x45d9f3b;
    y = (y >>> 16) ^ y;
    return y;
}

int universalHash2(long a, long b, long index) {
    long x = Long.rotateLeft(a, (int) index) ^ 
            Long.rotateRight(b, (int) index) ^ index;
    x = (x ^ (x >>> 32)) * 0xbf58476d1ce4e5b9L;
    return (int) ((x >>> 32) ^ x);
}

(The second method is actually broken for some values.)

I would like to have a hash function that is faster than those above, and is guaranteed to work in all cases (if possible provably correct, even thought that's not a strict requirement; it doesn't need to be cryptographically secure however).

I will call the universalHash method with incrementing index (first index 0, then index 1, and so on) for the same keys. It would be best if the next result could be calculated faster (e.g. without multiplication) from the previous result. But I also need to have a fast "direct access" if the index is some value (as in the example code).

Background

The problem I'm trying to solve is finding a MPHF (minimal perfect hash function) for a relatively small set of keys (up to 16 keys by directly mapping, and up to about 1024 keys by splitting into smaller subsets). For details on the algorithm, see my MinPerf project, specially the RecSplit algorithm. To support set of size 10^12 (like BBHash), I'm trying to internally use 128 bit signatures, which would simplify the algorithm.

You really need to expand more on the requirements. Your "universal hash" definition is very different from the mathematical one. — Eugene Sh., Oct 25 '17 at 14:45
An important aspect is, if the hash function has to be cryptographically secure or if it just needs to be equally distributed — Ctx, Oct 25 '17 at 15:04
Wanted to mention `Long.hashCode` and `Long.reverseBytes` for fiddling. — Joop Eggen, Oct 25 '17 at 15:19
This also seems like an [XY problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). *Why* do you need this hash? What *problem* are you trying to actually solve? — Andrew Henle, Oct 25 '17 at 15:58
How micro-optimized do you want this to be? Can you take advantage of SIMD 128b vectors to manipulate your data? (x86 SSE, ARM NEON, whatever MIPS calls theirs, ...) What about hardware CRC instructions on x86? It's not a great hash, but it is very fast with hardware support. (Maybe just concat the "index" and the key and do a 192 bit CRC32C. If you save the state after processing the key, you get the CRC of key + different indices even more cheaply) — Peter Cordes, Oct 26 '17 at 03:48
Do you really need 64-bit indices? Is something like an XOR of all four 4B chunks anywhere near useful? — Peter Cordes, Oct 26 '17 at 03:52
@PeterCordes the CRC idea sounds nice, but would require too much hardware support for me (I prefer a solution that works in C as well as Java... I know...). But popcount, clz, nlz and so on a fine. Carry-less multiplication probably not (not available in Java yet). reverseBytes should be fine, it seems to be [an intrinsic](https://gist.github.com/apangin/7a9b7062a4bd0cd41fcc). Indices should be about 40 bits (32 bit is not enough in some cases), but if there is a great solution with 32 bit, then I could adapt the algorithm maybe. — Thomas Mueller, Oct 26 '17 at 04:32
So you can assume `popcnt` hardware support but not `crc32`? Is it more of a software issue in taking advantage of it? They both appeared at the same time on Intel, in Nehalem (SSE4.2). But [AMD had `popcnt` starting with K10](https://en.wikipedia.org/wiki/SSE4#POPCNT_and_LZCNT), and not CRC32 until Bulldozer. Anyway, it's not *really* portable to assume hardware popcnt support, but you definitely want to use it if available, so you need some kind of runtime dispatching. Byte-reverse is cheap since 486 `bswap`, so you don't have to worry about that (but I don't think it helps). — Peter Cordes, Oct 26 '17 at 04:42
How portable do you need the solution to be? This is for your library which has C and Java versions and that anyone might use on any hardware? Including non-x86? Is it acceptable to use a totally different hash on different machines (run-time) or in different builds (compile time)? So if support is available, you could do something that would be slow in pure C, but that reduces collisions. See also https://stackoverflow.com/questions/17645167/implementing-sse-4-2s-crc32c-in-software for CRC32C with HW or SW. — Peter Cordes, Oct 26 '17 at 04:43
@PeterCordes I would prefer a solution that is available on ARM and Intel. Right now I'm experimenting with a solution inspired by [Daniel Lemire](https://lemire.me/blog/2017/09/18/visiting-all-values-in-an-array-exactly-once-in-random-order/), for 2^n, so that even multiplication is not needed if the same key is used with index, index+1, index+2. I don't think even hardware CRC can beat addition + shift + xor. — Thomas Mueller, Oct 26 '17 at 07:07
Yeah an LCG (linear congruential generator) should be good with a power-of-2 `n` (when the compiler *knows* it's a power of 2, or you manually code it to modulo using `& (n-1)`), but you could do that for the index part for any way of hashing the key down to 32-bit, couldn't you? (And BTW, one CRC32 instruction has 3c latency on Sandybridge, same as add+shift+xor but fewer instructions to decode. So it's quite good as long as you can use the instruction efficiently. The raw instruction might need some post-processing to make a real CRC32C, but you don't need that, just its mixing) — Peter Cordes, Oct 26 '17 at 07:16

score 0 · Answer 1 · answered Oct 25 '17 at 15:53

You need a hash function that outputs 32 bits for 128 bits of inputs.

A simple way would be to just return "some" 32 bits out of the original 128 bits. There are many ways of choosing 32 bits and every choice will have collisions. But the index can decide which 32 bits to choose.

128/32 = 4, so 4 indices are enough to find at least one different bit.

For key 0 you choose the lower most 32 bits
For key 1 you choose the next 32 bits
and so on ..

The C implementation would be

uint32_t universal_hash(uint64_t key_higher, uint64_t key_lower, int index) {
    // For a lack of portable 128 bit datatype we take the key in parts.
    return 0xFFFFFFFF & ( index >=2 ? key_higher >> ((index - 2)*32) : key_lower >> (index*32));
}

The problem is that this approach is not sufficiently random (see [avalanche effect](https://en.wikipedia.org/wiki/Avalanche_effect)) — Thomas Mueller, Oct 25 '17 at 19:00
@ThomasMueller yes, that is the case. But now it depends on your application. If you just want the hash to create a hashmap, random is not needed. But I will leave you as the judge of your application. — Ajay Brahmakshatriya, Oct 26 '17 at 04:53

Very fast universal hash function for 128 bit keys

1 Answers1