129

What integer hash function are good that accepts an integer hash key?

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
Lear
  • 1,893
  • 2
  • 13
  • 9

11 Answers11

209

I found the following algorithm provides a very good statistical distribution. Each input bit affects each output bit with about 50% probability. There are no collisions (each input results in a different output). The algorithm is fast except if the CPU doesn't have a built-in integer multiplication unit. C code, assuming int is 32 bit (for Java, replace >> with >>> and remove unsigned):

unsigned int hash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    return x;
}

The magic number was calculated using a special multi-threaded test program that ran for many hours, which calculates the avalanche effect (the number of output bits that change if a single input bit is changed; should be nearly 16 on average), independence of output bit changes (output bits should not depend on each other), and the probability of a change in each output bit if any input bit is changed. The calculated values are better than the 32-bit finalizer used by MurmurHash, and nearly as good (not quite) as when using AES. A slight advantage is that the same constant is used twice (it did make it slightly faster the last time I tested, not sure if it's still the case).

You can reverse the process (get the input value from the hash) if you replace the 0x45d9f3b with 0x119de1f3 (the multiplicative inverse):

unsigned int unhash(unsigned int x) {
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = ((x >> 16) ^ x) * 0x119de1f3;
    x = (x >> 16) ^ x;
    return x;
}

For 64-bit numbers, I suggest to use the following, even thought it might not be the fastest. This one is based on splitmix64, which seems to be based on the blog article Better Bit Mixing (mix 13).

uint64_t hash(uint64_t x) {
    x = (x ^ (x >> 30)) * UINT64_C(0xbf58476d1ce4e5b9);
    x = (x ^ (x >> 27)) * UINT64_C(0x94d049bb133111eb);
    x = x ^ (x >> 31);
    return x;
}

In this case, reversing is more complicated:

uint64_t unhash(uint64_t x) {
    x = (x ^ (x >> 31) ^ (x >> 62)) * UINT64_C(0x319642b2d24d8ec3);
    x = (x ^ (x >> 27) ^ (x >> 54)) * UINT64_C(0x96de1b173f119089);
    x = x ^ (x >> 30) ^ (x >> 60);
    return x;
}

All the above is for C. For Java, use long, add L to the constant, replace >> with >>> and remove unsigned.

Update: You may also want to look at the Hash Function Prospector project, where other (possibly better) constants are listed.

Thomas Mueller
  • 48,905
  • 14
  • 116
  • 132
  • 2
    the first two lines are exactly the same! is there a typo here? – Kshitij Banerjee Nov 23 '12 at 09:39
  • 7
    No this is not a typo, the second line further mixes the bits. Using just one multiplication isn't as good. – Thomas Mueller Nov 23 '12 at 09:54
  • 1
    yeah but the 'h' before was making it redundant .I figured it out though.. thanks for the edit. It is decently fast completing in 19 machine cycles on my machine. BTW, why'd you change the magic number? does better? – Kshitij Banerjee Nov 23 '12 at 10:09
  • 3
    I changed the magic number because according to a [test case I wrote](http://code.google.com/p/h2database/source/browse/trunk/h2/src/test/org/h2/test/store/CalculateHashConstant.java) the value 0x45d9f3b provides better [confusion and diffusion](http://en.wikipedia.org/wiki/Confusion_and_diffusion), specially that if one output bit changes, each other output bit changes with about the same probability (in addition to all output bits change with the same probability if an input bit changes). How did you measure 0x3335b369 works better for you? Is an int 32 bit for you? – Thomas Mueller Nov 23 '12 at 13:38
  • I just did it on some live test cases. I replaced the hash function and ran it in Quantify. Might be case specific though. – Kshitij Banerjee Nov 24 '12 at 20:12
  • I'm sorry what is Quantify? – Thomas Mueller Nov 25 '12 at 09:57
  • 1
    Its a performance testing tool. In this case , it showed me that the lookups to the map were on an average faster with the old magic number. But needs more testing to be sure. – Kshitij Banerjee Nov 26 '12 at 07:41
  • Using just one multiplication might be actually faster for your case (even thought you have more conflicts on average). It really depends on the use case. I use it as the supplemental secondary hash function for a concurrent LIRS hash map / cache, and I use 4 or 5 bits of the hash code to select a hash segment (each segment is synchronized). So in my case it is very important to get the best possible distribution to improve concurrency. – Thomas Mueller Nov 26 '12 at 08:39
  • 3
    I am searching for a nice hash function for 64 bit unsigned int to 32 bit unsigned int. Is for that case, above magic number will be same ? I shifted 32 bit instead of 16 bits. – alessandro Nov 30 '12 at 09:23
  • 1
    What I have done is following: uint32_t hash(uint64_t x) { x = ((x >> 32) ^ x) * 0x45d9f3b; x = ((x >> 32) ^ x) * 0x45d9f3b; x = ((x >> 32) ^ x); return (uint32_t) x; } [code is in c++] – alessandro Nov 30 '12 at 09:33
  • 3
    I believe in that case a larger factor would be better, but you would need to run some tests. Or (this is what I do) first use `x = ((x >> 32) ^ x)` and then use the 32 bit multiplications above. I'm not sure what's better. You may also want to look at [the 64-bit finalizer for Murmur3](http://code.google.com/p/smhasher/wiki/MurmurHash3) – Thomas Mueller Nov 30 '12 at 10:12
  • OK, so there are no collisions pre range-reduction, but what happens when I have to find a location in a 16k table. Where's the modulo to guarantee the hash value hashes to a location in the table's range? What happens to collisions once the range reduction modulo occurs? Without knowing what the range is, how can you know if you have a good hash algorithm? –  Oct 26 '14 at 17:56
  • 2
    @RocketRoy if you use modulo afterwards, of course there will be collisions at some point. The point is not to avoid collisions, but to make them as unlikely. The best you can possibly get is the same probability than a true random number (not pseudo random, but true random), but that's unfeasible. The second best is the same probability as a cryptographically secure random number generator (AES or SHA-3 for example), but that's slow. The algorithm above is fast but close to AES with regards to probability. – Thomas Mueller Oct 26 '14 at 18:49
  • Thanks Thomas. Helpful. I've been working on a Hash vs Binary search challenge here on SO, and there the OP spec-ed "16k" keys. A hash function I got somewhere (can't remember where anymore) multiplied the key by 31 % table size. That hash runs almost 4x as fast and produces fewer collisions - 3k vs 4k - for a prime number sized table around 4/3rds the size of the 16k keys. From what Olof Forshell is telling me, hash functions need to be tailored to the key count and data distribution. My bsearch2() function will outperform the hash function above for the given parameters. –  Oct 26 '14 at 19:21
  • 1
    @RocketRoy OK I see. Something to remember is that modulo operations (and divisions) are slow on most CPUs. I would avoid using those operations, and instead use a hash table size that is a power of 2. That way, you can just use bitwise and. But binary search sometimes has better locality. By the way, there is also interpolation sort, maybe that is even faster (log log n instead of log n). – Thomas Mueller Oct 27 '14 at 08:09
  • 1
    Agree a mod of a power of 2 is fastest, as division is then a bit shifts right, but not always practical. Bsearch() has one big problem as implemented in C/C++, and a smaller problem in general. In C/C++ the comparison function is called via a pointer. That turns out to be an order of magnitude albatross around it's neck. The other is the "stupid" comparisons at the end of the search where it takes 5-6 probes to resolve the last 32-64 values. Locality is excellent, and CPU load very light, as running many .exes proved. Bsearch2() was the result. http://stackoverflow.com/a/26415664/1899861 –  Oct 27 '14 at 22:49
  • You're the only other person I've met that knows about interpolation searches, but I've never been able to get them to out-perform bsearch(). However, using the Bsearch2() approach of reverting to a linear search at the end might change that. Thanks for the clarifications and feedback. Cheers! –  Oct 27 '14 at 22:53
  • 1
    Is there a way to reverse the process? i.e. generate the key from the hash? @RocketRoy – Santosh Ghimire Jan 05 '15 at 11:36
  • 2
    @SantoshGhimire Yes, in this case reversing the operation is possible, as there are no collisions. You just need to replace `0x45d9f3b` with `0x119de1f3` in the formula (the reciprocal value). – Thomas Mueller Jan 05 '15 at 14:04
  • 2
    @SantoshGhimire, the key caveat is there can be no collisions, otherwise two (or more) different values are represented by the same key, one of which must be unreversible. Unless memory is at a premium, and especially if CPU is, you wouldn't want to do this. As the hash is CPU intensive, it's preferable to look it up. Hash functions perform range reduction, and that is best thought of as lossy compression. Use reversal with great caution. –  Jan 05 '15 at 22:13
  • @ThomasMueller Replacing `0x45d9f3b` with `0x119de1f3` didn't work. Any other changes to be made in the function? – Santosh Ghimire Jan 06 '15 at 04:58
  • How do you know there are no collisions? In any case this hash seems to be good. I'm trying right now to find a collision using SMT solver Z3 and it is computing for hours already. Normally, such simple problems are solved very quickly. – usr May 15 '15 at 15:21
  • 1
    @usr To ensure there are no collisions, I ran a brute-force test, using two bit sets (one for positive, one for negative values). The tests takes a few minutes. The source code of the test is in the linked multi-threaded test program, CalculateHashConstant.java, method getCollisionCount(). I'm sure there are faster ways to test it, and I'm quite sure there is a mathematical proof, but it's enough for what I needed. – Thomas Mueller May 15 '15 at 16:18
  • @anytoe what is an uniform length in the context of integers? – Thomas Mueller Sep 06 '16 at 11:14
  • @ThomasMueller Good question - it's integer based of course. I asked because I need the same number of digits each time, or at least less than a specific number (and then fill up with 0). Does it make sense now? – Anytoe Sep 06 '16 at 12:25
  • 1
    @anytoe sounds like a different question to me (truncating / formatting an integer to a fixed number of digits, base 10). – Thomas Mueller Sep 06 '16 at 12:50
  • Can you talk more about how to design for certain bits. For instance, I need a hash function to hash integers < 2^24, to an integer < 2^24. i.e. I want the key and value to be both in the range of [0, 2^24). – cxs1031 Apr 27 '17 at 02:13
  • 1
    @cxs1031 as you can see the basic design is similar for 32 and 64 bits. I guess for different sizes the same design can be used. As for which constants to use, you would need to test that yourself. You could use my [test program](https://github.com/h2database/h2database/blob/master/h2/src/test/org/h2/test/store/CalculateHashConstant.java) as a starting point. – Thomas Mueller Apr 27 '17 at 18:08
  • @ThomasMueller Thanks! Can I ask how to choose the candidate constants? I think the constants you used in your original program for testing 32 bits are too large for 24 bits. I suppose they should be all within the range of [0, 2^24) in my case. But the smallest constant you used(0x10b5385) has a set bit larger than 2^24. – cxs1031 Apr 28 '17 at 20:51
  • 1
    @cxs1031 I'm sorry I will not be able to help you. You will need to read the code, and try to adopt it for your use case. – Thomas Mueller Apr 29 '17 at 11:57
  • Does the 64-bits version also have one-to-one mapping (producing unique results) for the full range [0, UINT64_MAX]? – plasmacel Aug 11 '17 at 14:20
  • @plasmacel I'm afraid I don't know, but I think no, as it's not using shift by 32. Also, I'm not sure if there are no conflicts that way. – Thomas Mueller Aug 15 '17 at 10:46
  • @ThomasMueller Your 32-bits version is bijective (as it would be expected from a good bit mixing function), so I assume the 64bit version should be also, however its hard to test empirically. I made a loop running from `0` to `UINT64_MAX`, parallelized by OpenMP on a quadcore Core i7, but after running 10 minutes I terminated the process. I will repeat the process when I don't use the computer and I can keep it running as long as needed. – plasmacel Aug 15 '17 at 11:08
  • 1
    @plasmacel I think the 64-bit version is reversible as well, but it's more complicated: e.g. `x ^ (x >> 31)` is revered using `x ^ (x >> 31) ^ (x >> 62)`. I will try to write and test a special reverse function. That would kind of proof it is bijective, right? – Thomas Mueller Nov 23 '17 at 17:04
  • 1
    @plasmacel I added the unhash methods for both 32 bit and 64 bit now. – Thomas Mueller Nov 24 '17 at 07:41
  • Thanks, this is excellent. If I could give you one more upvote I would do so. – plasmacel Nov 24 '17 at 07:50
  • @ThomasMueller I tried using ">>>" in Java and it yielded a negative integer. ">>" resulted in positive integers only. Did you get the logical and arithmetical shift confused here? Or did I make some mistake? – FooBar Feb 05 '18 at 20:26
  • 1
    @MarkusK from what I read, ">>>" shouldn't yield [negative integers](https://stackoverflow.com/questions/2811319/difference-between-and)... Could you re-check? – Thomas Mueller Feb 06 '18 at 05:22
  • @ThomasMueller “>>>“ itself does indeed produce positive numbers. The hash function, however, may return negative numbers if this operator is used (due to overflows?). I thought this behaviour was not desired. It could be “fixed“ with “>>“. Please correct me if I'm wrong (especially about the desired behaviour) – FooBar Feb 12 '18 at 11:43
  • 1
    @MarkusK ">>>" yields positive or 0. Yes, the hash function may return negative numbers (in Java), that's desired. Java only has signed integers, and the sign is really bit #31 (highest bit; the lowest bit being bit #0). It is desired to return 32 bits. For C / C++, you could use unsigned integers. – Thomas Mueller Feb 12 '18 at 16:07
  • Is there some reference how you came up with this hash? – Albert Mar 27 '18 at 14:43
  • @Albert The 32-bit hash is modelled after the 64-bit hash, just simpler (for speed), and with a different magic number, which was found as described in the answer. The 64-bit hash is based on splitmix64, see link in the answer. – Thomas Mueller Mar 28 '18 at 07:31
  • 1
    This should be the accepted answer. In fact, this is in my all-time hall-of-fame best stackoverflow answers ever. Simply amazing. – Adam Brown Sep 04 '20 at 10:49
  • Can we build some sort of shrine for you? @ThomasMueller Ah nevermind, I saw you're the maintainer of the H2 database, I guess there's a shrine already. Excellent routine, love it. – Stefan Reich Mar 15 '22 at 12:22
  • BTW I benched this to take less than 10 CPU cycles per invocation on my Core i5 in Java. – Stefan Reich Mar 15 '22 at 12:31
  • Oops, correction: It's more like 15 cycles – Stefan Reich Mar 15 '22 at 13:00
54

Knuth's multiplicative method:

hash(i)=i*2654435761 mod 2^32

In general, you should pick a multiplier that is in the order of your hash size (2^32 in the example) and has no common factors with it. This way the hash function covers all your hash space uniformly.

Edit: The biggest disadvantage of this hash function is that it preserves divisibility, so if your integers are all divisible by 2 or by 4 (which is not uncommon), their hashes will be too. This is a problem in hash tables - you can end up with only 1/2 or 1/4 of the buckets being used.

Svante
  • 50,694
  • 11
  • 78
  • 122
Rafał Dowgird
  • 43,216
  • 11
  • 77
  • 90
  • 45
    It's a really bad hash function, albeit attached to a famous name. – Seun Osewa Aug 16 '10 at 16:00
  • 6
    It's not a bad hash function at all if used with prime table sizes. Also, it is meant for _closed_ hashing. If hash values are not uniformly distributed, multiplicative hashing ensures that collisions from one value are unlikely to "disturb" items with other hash values. – Paolo Bonzini Jun 03 '11 at 07:28
  • 14
    For the curious, this constant is chosen to be the hash size (2^32) divided by Phi – awdz9nld May 31 '12 at 11:50
  • 8
    Paolo: Knuth's method is "bad" in the sense that it does not avalanche on the upper bits – awdz9nld May 31 '12 at 11:51
  • 1
    I don't understand where 2654435761 comes from. 2^32 / phi = 2654435769.5, which is close but clearly not the same. – karadoc Oct 22 '13 at 08:18
  • 1
    @karadoc: I'm guessing he simply didn't use a very precise calculator when calculating that -- that, or the constant itself was only approximated to N places. Or both. – Tim Čas Dec 05 '13 at 01:01
  • @TimČas That sounds likely, but it makes me wonder if theres even anything special about that number. Like, would the hash be any worse if I used, say, 2577310911 instead? I'm not completely sure why it's best to have 'no common factors' (presumably wrt the number being hashed), but if that's true then maybe we should be using a prime number.(?) – karadoc Dec 05 '13 at 12:05
  • 13
    On closer inspection, it turns out 2654435761 is actually a prime number. So that's probably why it was chosen rather than 2654435769. – karadoc Dec 05 '13 at 12:13
  • 1
    Using prime table sizes basically means you're applying a poor man's hash again with the prime modulo, it's only effective with poor hash functions, and detrimental with good ones – Eric Grange Oct 07 '14 at 10:45
  • 4
    A note of interest is that this function has very poor avalanche properties (it spreads values in a non-uniform way), and as such may cause clustering when used in for example hash tables – awdz9nld Jan 14 '15 at 22:02
  • 5
    The answer fails to mention that the *upper bits* of the hash value must be used for good performance (right shift for table sizes that are powers of two). Using `hash(i) % table_size` results in a huge number of collisions. – nwellnhof Mar 06 '16 at 14:59
  • 5
    Please stop naming this hash function "Knuth's multiplicative method". This is not what Knuth proposed. – InsideLoop Jan 09 '17 at 10:51
  • can you explain what do you mean by `in the order of your hash size`, and why 2654435761 is `in the order of` 2^32? Do you mean that 2654435761 is the largest prime number in the range of [0, 2^32) ? Thanks! – cxs1031 Sep 27 '17 at 19:09
  • Makes perfect sense for cases like dividing up a table into shards based upon a sequential primary key – Andy Dec 21 '18 at 20:09
  • @cxs1031 he means the same order of magnitude. In computer science this often means the same number of bits. – Wolfgang Brehm Sep 18 '19 at 11:06
  • @EricGrange — I don't think that's correct that a prime modulus is detrimental with good hashes. With a good hash, the modulus should be irrelevant. And if it's speed you're worried about, you can compute the modulus efficiently using the multiplicative inverse rather than hardware division. – Todd Lehman Apr 10 '20 at 00:05
  • This function is *extremely* slow compared to a proper hash function. Test it first before using it and do a proper performance evaluation. – Alexandre M Jun 01 '21 at 22:42
  • @AlexandreM Source? Are you talking about 1980s microcontrollers? Are you talking about interpreted languages with inefficient multiplication? Are you talking about extended instruction sets for accelerating these types of computations? Some of the best hash and fastest functions use 64-bit multiplication. In fact, a single multiplication operation is very fast on modern processors. I think you may be mistaken. – bryc Jun 07 '21 at 06:12
  • @InsideLoop Can you provide a source or some context? I see nothing to confirm nor deny its origin. So if you think you can make such a claim, please back it up with reproducible facts. – bryc Jun 07 '21 at 06:18
  • @bryc I've already pointed the source in my own comment: *a real performance test*. This "hash" performs a multiplication + a modulo operation, both are much more expensive than proper hash functions. DJB hash, for instance, is simpler and orders of magnitude faster than this. – Alexandre M Jun 08 '21 at 02:15
  • @AlexandreM Your comment is an anecdote, not a source. DJB also uses multiplication (using 33) just like this hash, and as it does not pass _any_ tests in SMHasher, I hardly consider it a "proper" hash. I'd love to test your benchmark test code that lead to this conclusion if you could provide it. `Mod 2^32` is not actually modular arithmetic, because it is the same result as `AND 0xFFFFFFFF`, or mere unsigned 32-bit integer overflow. Modular arithmetic is only used in mathematical definitions. – bryc Jun 08 '21 at 03:56
  • @AlexandreM If one uses type unsigned int, and skip the initial value of hash, knuth's hash (`hash = (hash + key[i]) * 2654435761`) is essentially the same algorithm as DJB (`hash = (hash + key[i]) * 33`). The only difference is that the multiplier is smaller. As I said before, modern processors have no trouble doing 64-bit multiplication, which is why it is used in modern hashing algorithms such as xxHash3, and wyhash. Though admittedly, the unmodified "Knuth algorithm, `hash = key[i] * 2654435761` is extremely weak as a general hash as it cannot handle keys containing null bytes. – bryc Jun 08 '21 at 04:51
  • @AlexandreM I don't appreciate the assumption that I "don't know what I'm talking about", there is nothing wrong with what I said and you've not provided actual benchmarks to test your claim. DJB uses a a multiplier of 33; that was the original given constant. Similarly, the prime multiplier 31 had been used in the hash function in Java and Emacs. Yes, `n * 33` can be implemented with `(n << 5) + n`. `n * 31` can be implemented with `(n << 5) - n`. In terms of performance on modern CPUs, a single `imul` instruction is faster than `shl + add/sub`, it is only slower on old CPUs like 8086. – bryc Jun 09 '21 at 18:40
  • @bryc I don't mind, but for the sake of coherence I suggest you start by posting your own benchmark result that proves that: "on modern CPUs, a single imul instruction is faster than shl + add/sub, it is only slower on old CPUs like 8086". – Alexandre M Jun 09 '21 at 23:41
  • 1
    I deleted some rude comments here, but these Q&A are relevant to the claim about performance of a single multiplication vs. shift+arithmetic: https://stackoverflow.com/q/37925143/x86-64-is-imul-faster-than-2x-shl-2x-add, and https://stackoverflow.com/q/40564004/multiplication-with-constant-imul-or-shl-add-combination, and https://stackoverflow.com/q/57110383/x86-multiplication-with-3-imul-vs-shl-add. If you want to start a discussion about the advantages and disadvantages, please do so as a new question, rather than extending this extremely long comment thread any further. – Cody Gray - on strike Jun 10 '21 at 06:47
  • Note. ■ Knuth did suggest this (see the source in https://stackoverflow.com/a/41537995/5267751). ■ there are a few notes needed to use it effectively. (e.g. don't compute hash value mod 2^n for small n) ■ The answer linked above claims that Knuth did not suggest using an adjacent prime at all, in fact he only said to use inverse of golden ratio. ■ (Anecdote: Knuth used some hash function in his famous TeX program that does not perform well for keys consist of integer values https://tex.stackexchange.com/q/147966/250119 although he most likely didn't intend it to be used that way) c – user202729 Nov 21 '22 at 02:25
32

Depends on how your data is distributed. For a simple counter, the simplest function

f(i) = i

will be good (I suspect optimal, but I can't prove it).

erikkallen
  • 33,800
  • 13
  • 85
  • 120
  • 1
    Proof by contradiction. Assume that bucket b has more than one key: k1 and k2 where k1 != k2. Then f(k1) = k1 = b and f(k2) = k2 = b, by transitivity k1 = k2. QED. Although this function means that the size of your hash table is the size of your input set which completely defeats the point of a hash – Tyler McHenry Mar 19 '09 at 21:05
  • @Tyler: The hash container will use f(i) % size to decide which bucket to put the value in. However, only f(i) is the hash function, the modulo operation belongs to the hash table. The object (integer) can't know which hash table it will be put in. – erikkallen Mar 20 '09 at 16:35
  • 3
    The problem with this is that it's common to have large sets of integers that are divisible by a common factor (word-aligned memory adresses etc.). Now if your hash table happens to be divisible by the same factor, you end up with only half (or 1/4, 1/8, etc.) buckets used. – Rafał Dowgird Mar 20 '09 at 16:56
  • 10
    @Rafal: That's why the response says "for a simple counter" and "Depends on how your data is distributed" – erikkallen Mar 21 '09 at 12:17
  • 6
    That's actually the implementation by Sun of the method hashCode() in java.lang.Integer http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/lang/Integer.java#Integer.hashCode%28%29 – Juande Carrion Oct 04 '12 at 16:56
  • 5
    @JuandeCarrion That is misleading because that is not the hash that is being used. After moving over to using power of two table sizes, Java rehashes every hash returned from `.hashCode()`, see [here](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/6-b14/java/util/HashMap.java#HashMap.hash%28int%29). – Esailija Jun 01 '13 at 00:49
  • 8
    The identity function is fairly useless as a hash in many practical applications due to its distributive properties (or lack thereof), unless, of course, locality is a desired attribute – awdz9nld Jan 20 '15 at 21:34
25

Fast and good hash functions can be composed from fast permutations with lesser qualities, like

  • multiplication with an uneven integer
  • binary rotations
  • xorshift

To yield a hashing function with superior qualities, like demonstrated with PCG for random number generation.

This is in fact also the recipe rrxmrrxmsx_0 and murmur hash are using, knowingly or unknowingly.

I personally found

uint64_t xorshift(const uint64_t& n,int i){
  return n^(n>>i);
}
uint64_t hash(const uint64_t& n){
  uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
  uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
  return c*xorshift(p*xorshift(n,32),32);
}

to be good enough.

A good hash function should

  1. be bijective to not lose information, if possible and have the least collisions
  2. cascade as much and as evenly as possible, i.e. each input bit should flip every output bit with probability 0.5.

Let's first look at the identity function. It satisfies 1. but not 2. :

identity function

Input bit n determines output bit n with a correlation of 100% (red) and no others, they are therefore blue, giving a perfect red line across.

A xorshift(n,32) is not much better, yielding one and half a line. Still satisfying 1., because it is invertible with a second application.

xorshift

A multiplication with an unsigned integer ("Knuth's multiplicative method") is much better, cascading more strongly and flipping more output bits with a probability of 0.5, which is what you want, in green. It satisfies 1. as for each uneven integer there is a multiplicative inverse.

knuth

Combining the two gives the following output, still satisfying 1. as the composition of two bijective functions yields another bijective function.

knuth•xorshift

A second application of multiplication and xorshift will yield the following:

proposed hash

Or you can use Galois field multiplications like GHash, they have become reasonably fast on modern CPUs and have superior qualities in one step.

   uint64_t const inline gfmul(const uint64_t& i,const uint64_t& j){           
     __m128i I{};I[0]^=i;                                                          
     __m128i J{};J[0]^=j;                                                          
     __m128i M{};M[0]^=0xb000000000000000ull;                                      
     __m128i X = _mm_clmulepi64_si128(I,J,0);                                      
     __m128i A = _mm_clmulepi64_si128(X,M,0);                                      
     __m128i B = _mm_clmulepi64_si128(A,M,0);                                      
     return A[0]^A[1]^B[1]^X[0]^X[1];                                              
   }
user202729
  • 3,358
  • 3
  • 25
  • 36
Wolfgang Brehm
  • 1,491
  • 18
  • 21
  • 1
    gfmul: The code appears to be pseudo-code, since afaik you can't use brackets with __m128i. Still very interesting. The first line appears to say "take an unitialized __m128i (I) and xor it with (parameter) i. Should I read this as initialize I with 0 and xor with i? If so, would it be the same as load I with i and perform a not (operation) on I? – Jan May 16 '20 at 11:18
  • @Jan what I'd want is to do is `__m128i I = i; //set the lower 64 bits`, but that I can't, so I'm using `^=`. `0^1 = 1` therefore no not involed. Regarding the initialisation with `{}` my compiler never complained, it might not be the best solution, but what I want with that is initialise all of it to 0 so I can do `^=` or `|=` . I think I based that code on [this blogpost](https://blog.quarkslab.com/reversing-a-finite-field-multiplication-optimization.html) which also gives the inversion, very usefull :D – Wolfgang Brehm May 16 '20 at 11:43
  • 1
    Why should a hash function be reversible? – Violet Giraffe Jan 20 '23 at 00:19
  • 2
    @VioletGiraffe If a hash is not bijective, it means that the output distribution cannot be uniform. If a hash is bijective, it means that it uses as much of the output space as possible. That is why it is recommended that a hash is bijective, making it reversible, but only disregarding computational feasibility. – Wolfgang Brehm Jan 20 '23 at 11:06
  • 1
    @VioletGiraffe If a hash is bijective it is invertible in principle. That does not mean that it is computationally feasible to invert it. Only very specific applications require that a hash be invertible easily. Crypographic applications require hashes to be very hard to invert. You can have a Bijective hash that is very hard to invert (if you want one let me know how to contact you, this comment is too short). And you can have a non-bijective hash that is easily reversible, up to the ambiguities. – Wolfgang Brehm Jan 20 '23 at 11:11
  • 1
    I see, thank you, I think now I understand why it's desirable for the hash function to be bijective (regardless of whether or not it is invertible in practice). – Violet Giraffe Jan 20 '23 at 11:48
7

This page lists some simple hash functions that tend to decently in general, but any simple hash has pathological cases where it doesn't work well.

Tyler McHenry
  • 74,820
  • 18
  • 121
  • 166
7
  • 32-bits multiplicative method (very fast) see @rafal

    #define hash32(x) ((x)*2654435761)
    #define H_BITS 24 // Hashtable size
    #define H_SHIFT (32-H_BITS)
    unsigned hashtab[1<<H_BITS]  
    .... 
    unsigned slot = hash32(x) >> H_SHIFT
    
  • 32-bits and 64-bits (good distribution) at : MurmurHash

  • Integer Hash Function
leventov
  • 14,760
  • 11
  • 69
  • 98
bill
  • 1,321
  • 11
  • 9
5

I have been using splitmix64 (pointed in Thomas Mueller's answer) ever since I found this thread. However, I recently stumbled upon Pelle Evensen's rrxmrrxmsx_0, which yielded tremendously better statistical distribution than the original MurmurHash3 finalizer and its successors (splitmix64 and other mixes). Here is the code snippet in C:

#include <stdint.h>

static inline uint64_t ror64(uint64_t v, int r) {
    return (v >> r) | (v << (64 - r));
}

uint64_t rrxmrrxmsx_0(uint64_t v) {
    v ^= ror64(v, 25) ^ ror64(v, 50);
    v *= 0xA24BAED4963EE407UL;
    v ^= ror64(v, 24) ^ ror64(v, 49);
    v *= 0x9FB21C651E98DF25UL;
    return v ^ v >> 28;
}

Pelle also provides an in-depth analysis of the 64-bit mixer used in the final step of MurmurHash3 and the more recent variants.

Frederico Schardong
  • 1,946
  • 6
  • 38
  • 62
  • 3
    This function is not bijective. For all v where v = ror(v,25), namely all 0 and all 1 it will produce the same output in two places. For all values v = ror64(v, 24) ^ ror64(v, 49), which are at least two more and the same with v = ror(v,28), yielding another 2^4 , totaling around around 22 unnecessary collisions. Two applications of splitmix are probably just as good and just as fast, but still invertible and collision-free. – Wolfgang Brehm Sep 13 '19 at 15:02
3

For random hash values, some engineers said golden ratio prime number(2654435761) is a bad choice, with my testing results, I found that it's not true; instead, 2654435761 distributes the hash values pretty good.

#define MCR_HashTableSize 2^10

unsigned int
Hash_UInt_GRPrimeNumber(unsigned int key)
{
  key = key*2654435761 & (MCR_HashTableSize - 1)
  return key;
}

The hash table size must be a power of two.

I have written a test program to evaluate many hash functions for integers, the results show that GRPrimeNumber is a pretty good choice.

I have tried:

  1. total_data_entry_number / total_bucket_number = 2, 3, 4; where total_bucket_number = hash table size;
  2. map hash value domain into bucket index domain; that is, convert hash value into bucket index by Logical And Operation with (hash_table_size - 1), as shown in Hash_UInt_GRPrimeNumber();
  3. calculate the collision number of each bucket;
  4. record the bucket that has not been mapped, that is, an empty bucket;
  5. find out the max collision number of all buckets; that is, the longest chain length;

With my testing results, I found that Golden Ratio Prime Number always has the fewer empty buckets or zero empty bucket and the shortest collision chain length.

Some hash functions for integers are claimed to be good, but the testing results show that when the total_data_entry / total_bucket_number = 3, the longest chain length is bigger than 10(max collision number > 10), and many buckets are not mapped(empty buckets), which is very bad, compared with the result of zero empty bucket and longest chain length 3 by Golden Ratio Prime Number Hashing.

BTW, with my testing results, I found one version of shifting-xor hash functions is pretty good(It's shared by mikera).

unsigned int Hash_UInt_M3(unsigned int key)
{
  key ^= (key << 13);
  key ^= (key >> 17);    
  key ^= (key << 5); 
  return key;
}
  • 2
    But then why not shift the product right, so you keep the most-mixed bits? That was the way it was supposed to work – harold Feb 15 '19 at 12:28
  • 1
    @harold, golden ratio prime number is carefully chosen, though I think it won't make any difference, but I will test to see if it's much better with the "the most-mixed bits". While my point is that "It's not a good choice." is not true, as the testing results show, just grab the lower part of the bits is good enough, and even better than many hash functions. – Chen-ChungChia Feb 15 '19 at 13:18
  • (2654435761, 4295203489) is a golden ratio of primes. – Chen-ChungChia Feb 15 '19 at 13:47
  • (1640565991, 2654435761) is also a golden ratio of primes. – Chen-ChungChia Feb 15 '19 at 14:12
  • @harold, Shifting the product right becomes worse, even if just shifting right by 1 position (divided by 2), it still becomes worse (though still zero empty bucket, but longest chain length is bigger); shifting right by more positions, the result becomes more worse. Why? I think the reason is: shifting the product right makes more hash values not to be coprime, just my guess, the real reason involves number theory. – Chen-ChungChia Feb 16 '19 at 06:29
  • @Chen-ChungChia I think this is because the low bits preserve the bijectivity and will produce no collisions at all if all values are less than MCR_HashTableSize. – Wolfgang Brehm Sep 13 '19 at 14:03
3

There's a nice overview over some hash algorithms at Eternally Confuzzled. I'd recommend Bob Jenkins' one-at-a-time hash which quickly reaches avalanche and therefore can be used for efficient hash table lookup.

Christoph
  • 164,997
  • 36
  • 182
  • 240
  • 4
    That is a good article, but it is focused on hashing string keys, not integers. – Adrian Mouat Jun 17 '10 at 10:52
  • 1
    Just to be clear, although the methods in the article would work for integers (or could be adapted to), I assume there are more efficient algorithms for integers. – Adrian Mouat Jun 17 '10 at 11:06
1

I don't think we can say that a hash function is "good" without knowing your data in advance ! and without knowing what you're going to do with it.

There are better data structures than hash tables for unknown data sizes (I'm assuming you're doing the hashing for a hash table here ). I would personally use a hash table when I Know I have a "finite" number of elements that are needing stored in a limited amount of memory. I would try and do a quick statistical analysis on my data, see how it is distributed etc before I start thinking about my hash function.

Ouanixi
  • 116
  • 1
  • 10
1

The answer depends on a lot of things like:

  • Where do you intend to employ it?
  • What are you trying to do with the hash?
  • Do you need a crytographically secure hash function?

I suggest that you take a look at the Merkle-Damgard family of hash functions like SHA-1 etc

jscs
  • 63,694
  • 13
  • 151
  • 195
dirkgently
  • 108,024
  • 16
  • 131
  • 187