What are good methods for hashing bits in an Int32 or UInt32?

Question

I have an implementation of a pseudo random number generator, specifically of George Marsaglia's XOR-Shift RNG. My implementation is here:

FastRandom.cs

It turns out that the first random sample is very closely correlated with the seed, which is fairly obvious if you take a look at the Reinitialise(int seed) method. This is bad. My proposed solution is to mix up the bits of the seed as follows:

_x = (uint)(  (seed * 2147483647) 
           ^ ((seed << 16 | seed >> 48) * 28111) 
           ^ ((seed << 32 | seed >> 32) * 69001)
           ^ ((seed << 48 | seed >> 16) * 45083));

So I have significantly weakened any correlation by multiplying the seed's bits with four primes and XORing back to form _x. I also rotate the seed's bits before multiplication to ensure that bits of varying magnitudes get mixed up across the full range of values for a 32 bit value.

The four-way rotation just seemed liked a nice balance between doing nothing and every possible rotation (32). The primes are 'finger in the air' - enough magnitude and bit structure to jumble up the bits and 'spread' them over the full 32 bits regardless of the starting seed.

Should I use bigger primes? Is there a standard approach to this problem, perhaps with a more formal basis? I am trying to do this with minimal CPU overhead.

Thanks

=== UPDATE ===

I decided to use some primes with set bits better distributed across all 32 bits. The result is that I can omit the shifts as the multiplications achieve the same effect (hashing bits across the full range of 32 bits), so I then just add the four products to give the final seed...

_x = (uint)(  (seed * 1431655781) 
            + (seed * 1183186591) 
            + (seed * 622729787)
            + (seed * 338294347));

I could possibly get away with fewer primes/multiplciations. Two seemed too few (I could still see patterns in the first samples), three looked OK, so for a safety margin I made it four.

=== UPDATE 2 ===

FYI the above reduces to the functionally equivalent:

_x = seed * 3575866506U;

I didn't spot this initially and when I did I was wondering if overflowing at different stages in the calculation would cause a different result. I believe the answer is no - the two calculations always give the same answer.

`It turns out that the first random sample is very closely correlated with the seed` -- That's why it's called a *pseudo* random number generator. No matter which algorithm you use to "randomize" the seed, the same seed will still produce the same opening sequence. If you really want to mix it up, obtain some of the lower-order bits of the CPU clock, and use that to "randomize" your seed. — Robert Harvey, Oct 03 '12 at 21:52
@RobertHarvey Specifically there is a correlation between which bits are set, not just the predictability of the sequence (which we obviously can't overcome). So e.g. seeds 1, 2, 3 might generate a first value of 1001, 1002, 1003 (as broad example of the problem). — redcalx, Oct 03 '12 at 21:56
Also using the clock is bad if you init multiple rngs within a clock cycle (which is an issue I have encountered at times). — redcalx, Oct 03 '12 at 21:57
Can you just throw away the first value, or maybe the first n values randomly determined by the clock? — Robert Harvey, Oct 03 '12 at 21:57
Good idea. I tried that but it seems that this particular type of RNG has bitwise correlation between the Nth values following reinit. I tried discarding the first 4 samples but the problem wasn't going away, so decided to try this approach instead. — redcalx, Oct 03 '12 at 22:00
See Marsaglia, George. (2003). Xorshift RNGs. http://www.jstatsoft.org/v08/i14/paper — redcalx, Oct 03 '12 at 22:03
Of course you always get the same output for a given seed. Just as you will always get the same output sequence for a given seed. It wouldn't matter if you throw away the first 100 results and then started tracking the output. The algorithm has no new inputs. Without new random inputs into it as you generate output sequence, it will always generate the same output given the same seed. So, to change that, add in some new imput. You could get it from a local device, like CPU temperature, fan speed, etc. Add in multiple sources if you want to make it alter the generated sequence/output. — StarPilot, Oct 03 '12 at 22:14
@StarPilot Indeed, but that is not the point. The First number for seed 1 has most of the same bits set as for seed 2, etc. If however we hash the seed bits then that correlation is broken. This is what the example code does above, but that is just something I cooked up. It then occurred to me that this is essentially an In32 hash function and that there maybe some standard Int32 hash functions that I can use that are well tested and understood, thus avoiding any further pitfalls (patterns in the RNG). — redcalx, Oct 04 '12 at 08:22
If you look for 32-bit hash function you may want to look at this post: [32-bit checksum algorithm better quality than CRC32](http://stackoverflow.com/questions/8397438/32-bit-checksum-algorithm-better-quality-than-crc32) — Artemix, Dec 05 '12 at 16:02
Your (lagged fibonacci hybrid) xorshift generator uses the triplet 8,11,19, which does not have maximal period (you probably meant to use 9,11,19?). Also, the shift-xor operations seem a bit weird. `t` (a.k.a. `x`) is xored with two of shifted values of itself, and that is xored with `w` and a shifted value of `w`. You're normally supposed to xor a value three times with shifted values of the respective temporary. I'm not sure whether what you do is equivalent. — Damon, Dec 05 '12 at 17:19
You can give a look at ScriptSharper NameHasher (nested in NameTable) class (https://github.com/nikhilk/scriptsharp/blob/cc/src/Core/Compiler/Parser/NameTable.cs#L99). — gsscoder, Jan 30 '13 at 18:49
See also https://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key — Thomas Mueller, Sep 06 '20 at 07:11

score 3 · Answer 1 · edited May 02 '21 at 07:24

3

According to some researchers, CrapWow, Crap8 and Murmur3 are the best non-cryptographic hash algorithms available today that are both fast, simple and statistically good.

More information is available at Non-Cryptographic Hash Function Zoo.

Edit: As of May, 2021 the floodberry.com links to the Non-Cryptographic Hash Function Zoo are not valid. The content can still be found on archive.org.

edited May 02 '21 at 07:24

aghast

14,785
3
24
56

answered Feb 23 '13 at 21:02

ogggre

2,204
1
23
19

6

The links are dead and Google doesn't show any obvious mirror. – Evgeni Sergeev May 31 '16 at 15:23

What are good methods for hashing bits in an Int32 or UInt32?

1 Answers1

Linked