19

What are some simple ways to hash a 32-bit integer (e.g. IP address, e.g. Unix time_t, etc.) down to a 16-bit integer?

E.g. hash_32b_to_16b(0x12345678) might return 0xABCD.

Let's start with this as a horrible but functional example solution:

function hash_32b_to_16b(val32b) {
    return val32b % 0xffff;
}

Question is specifically about JavaScript, but feel free to add any language-neutral solutions, preferably without using library functions.

The context for this question is generating unique IDs (e.g. a 64-bit ID might be composed of several 16-bit hashes of various 32-bit values). Avoiding collisions is important.

Simple = good. Wacky+obfuscated = amusing.

dkamins
  • 21,450
  • 7
  • 55
  • 59
  • 1
    XOR the high 2 bytes with the low 2 bytes? 0x1234 XOR 0x5678. But you can't tag the question with 'cryptography' and ask for something like this... – Remus Rusanu Jun 17 '10 at 00:35
  • @Remus: Why can't I tag it 'cryptography'? Isn't this a distilled & extremely simple crypto-related question? P.S. Why not post your comment as an answer? – dkamins Jun 17 '10 at 00:39
  • To Remus's point, I agree that this isn't about cryptography. If I'm thinking about this right, your 16-bit hash will map to one of two 32-bit integers. I'm curious about the particular problem you're trying to solve, and I hope it has nothing to do with security. – John Bledsoe Jun 17 '10 at 00:43
  • 1
    In the same fashion as the previous comment, because there's no way to represent the same amount of uniqueness in a 32bit number with a 16bit number, you may as well just take the one half of the digits. e.g. 0x1234 or 0x5678. In this way, at least the loss of uniqueness is hopefully really obvious to future maintainers of the code. – lzcd Jun 17 '10 at 00:45
  • @Remus, @jmbledsoe: I removed the "cryptography" tag. I think it's relevant, but I don't want that issue to distract from the question. – dkamins Jun 17 '10 at 00:47
  • FYI the context for this question is generating unique IDs. – dkamins Jun 17 '10 at 00:48
  • @lzcd: I'm asking for a hash function. BY DEFINITION the result will not have the same amount of uniqueness as the source data. – dkamins Jun 17 '10 at 00:50
  • What are the requirements? No solution is going to be universally better than another, but one might stand out above the rest if we knew what your input values were and what the output value is used for. – hobbs Jun 17 '10 at 00:50
  • It's important to factor in how widely distributed your input values will be. Just because your inputs are within a 32-bit space doesn't mean they're evenly distributed. You likely want a hash that gives fairly evenly distributed results over your likely input range, not over your entire possible input range. – fencepost Jun 17 '10 at 00:59
  • 1
    Cryptographic is one possible kind of "good" for hashes. It implies a certain amount of divorce between the plaintext and the hash. The first comment here doesn't have that quality (cryptographic), but is still a good hash for many uses. – Slartibartfast Jun 17 '10 at 01:18
  • 17
    The following page has several implementations of general purpose hash functions that are efficient and exhibit minimal collisions: http://partow.net/programming/hashfunctions/index.html –  Oct 31 '10 at 23:12

6 Answers6

9

The key to maximizing the preservation of entropy of some original 32-bit 'signal' is to ensure that each of the 32 input bits has an independent and equal ability to alter the value of the 16-bit output word.

Since the OP is requesting a bit-size which is exactly half of the original, the simplest way to satisfy this criteria is to xor the upper and lower halves, as others have mentioned. Using xor is optimal because—as is obvious by the definition of xor—independently flipping any one of the 32 input bits is guaranteed to change the value of the 16-bit output.

The problem becomes more interesting when you need further reduction beyond just half-the-size, say from a 32-bit input to, let's say, a 2-bit output. Remember, the goal is to preserve as much entropy from the source as possible, so solutions which involve naively masking off the two lowest bits with (i & 3) are generally heading in the wrong direction; doing that guarantees that there's no way for any bits except the unmasked bits to affect the result, and that generally means there's an arbitrary, possibly valuable part of the runtime signal which is being summarily discarded without principle.

Following from the earlier paragraph, you could of course iterate with xor three additional times to produce a 2-bit output with the desired property of being equally-influenced by each/any of the input bits. That solution is still optimally correct of course, but involves looping or multiple unrolled operations which, as it turns out, aren't necessary!

Fortunately, there is a nice technique of only two operations which gives the same optimal result for this situation. As with xor, it not only ensures that, for any given 32-bit value, twiddling any input bit will result in a change to the 2-bit output, but also that, given a uniform distribution of input values, the distribution of 2-bit output values will also be perfectly uniform. In the current example, the method divides the 4,294,967,296 possible input values into exactly 1,073,741,824 each of the four possible 2-bit hash results { 0, 1, 2, 3 }.

The method I mention here uses specific magic values that I discovered via exhaustive search, and which don't seem to be discussed very much elsewhere on the internet, at least for the particular use under discussion here (i.e., ensuring a uniform hash distribution that's maximally entropy-preserving). Curiously, according to this same exhaustive search, the magic values are in fact unique, meaning that for each of target bit-widths { 16, 8, 4, 2 }, the magic value I show below is the only value that, when used as I show here, satisfies the perfect hashing criteria outlined above.

Without further ado, the unique and mathematically optimal procedure for hashing 32-bits to n = { 16, 8, 4, 2 } is to multiply by the magic value corresponding to n (unsigned, discarding overflow), and then take the n highest bits of the result. To isolate those result bits as a hash value in the range [0 ... (2ⁿ - 1)], simply right-shift (unsigned!) the multiplication result by 32 - n bits.

The "magic" values, and C-like expression syntax are as follows:


Method

Maximum-entropy-preserving hash for reducing 32 bits to. . .

Target Bits    Multiplier    Right Shift       Expression [1, 2]
-----------   ------------   -----------   -----------------------
    16         0x80008001        16        (i * 0x80008001) >> 16
     8         0x80808081        24        (i * 0x80808081) >> 24
     4         0x88888889        28        (i * 0x88888889) >> 28
     2         0xAAAAAAAB        30        (i * 0xAAAAAAAB) >> 30

Maximum-entropy-preserving hash for reducing 64 bits to. . .

Target Bits   Multiplier           Right Shift            Expression [1, 2]
-----------   ------------------   -----------   -------------------------------
    32        0x8000000080000001       32        (i * 0x8000000080000001) >> 32
    16        0x8000800080008001       48        (i * 0x8000800080008001) >> 48
     8        0x8080808080808081       56        (i * 0x8080808080808081) >> 56
     4        0x8888888888888889       60        (i * 0x8888888888888889) >> 60
     2        0xAAAAAAAAAAAAAAAB       62        (i * 0xAAAAAAAAAAAAAAAB) >> 62

Notes:

  1. Use unsigned multiply and discard any overflow (64-bit multiply is not needed).
  2. If isolating the result using right-shift (as shown), be sure to use an unsigned shift operation.

Further discussion

I find this all this quite cool. In practical terms, the key information-theoretical requirement is the guar­antee that, for any m-bit input value and its corresponding n-bit hash value result, flipping any one of the m source bits always causes some change in the n-bit result value. Now al­though there are 2ⁿ possible result values in total, one of them is already "in-use" (by the result itself) since "switching" to that one from any other result would be no change at all. This leaves 2ⁿ - 1 result values that are eligible to be used by the entire set of m input values flipped by a single bit.

Let's consider an example; in fact, to show how this technique might seem to border on spooky or downright magical, we'll consider the more extreme case where m = 64 and n = 2. With 2 output bits there are four possible result values, { 0, 1, 2, 3 }. Assuming an arbitrary 64-bit input value 0x7521d9318fbdf523, we obtain its 2-bit hash value of 1:

 (0x7521d9318fbdf523 * 0xAAAAAAAAAAAAAAAB) >> 62   // result -->  '1'

So the result is 1 and the claim is that no value in the set of 64 values where a single-bit of 0x7521d9318fbdf523 is toggled may have that same result value. That is, none of those 64 other results can use value 1 and all must instead use either 0, 2, or 3. So in this example it seems like every one of the 2⁶⁴ input values—to the exclusion of 64 other input values—will selfishly hog one-quarter of the output space for itself. When you consider the sheer magnitude of these interacting constraints, can a simultaneously satisfying solution overall even exist?

Well sure enough, to show that (exactly?) one does, here are the hash result values, listed in order, for inputs that flipping a single bit of 0x7521d9318fbdf523 (one at a time), from MSB (position 63) down to LSB (0).

3 2 0 3 3 3 3 3 3 0 0 0 3 0 3 3 0 3 3 3 0 0 3 3 3 0 0 3 3 0 3 3  // continued…
0 0 3 0 0 3 0 3 0 0 0 3 0 3 3 3 0 3 0 3 3 3 3 3 3 0 0 0 3 0 0 3  // notice: no '1' values

As you can see, there are no 1 values, which entails that every bit in the source "as-is" must be contributing to influence the result (or, if you prefer, the de facto state of each-and-every bit in 0x7521d9318fbdf523 is essential to keeping the entire overall result from being "not-1"). Because no matter what single-bit change you make to the 64-bit input, the 2-bit result value will no longer be 1.

Keep in mind that the "missing-value" table shown above was dumped from the analysis of just the one randomly-chosen example value 0x7521d9318fbdf523; every other possible input value has a similar table of its own, each one eerily missing its owner's actual result value while yet somehow being globally consistent across its set-membership. This property essentially corresponds to maximally preserving the available entropy during the (inherently lossy) bit-width reduction task.

So we see that every one of the 2⁶⁴ possible source values independently imposes, on exactly 64 other source values, the constraint of excluding one of the possible result values. What defies my intuition about this is that there are untold quadrillions of these 64-member sets, each of whose members also belongs to 63 other, seemingly unrelated bit-twiddling sets. Yet somehow despite this most confounding puzzle of interwoven constraints, it is nevertheless trivial to exploit the one (I surmise) resolution which simultaneously satisfies them all exactly.

All this seems related to something you may have noticed in the tables above: namely, I don't see any obvious way to extend the technique to the case of compressing down to a 1-bit result. In this case, there are only two possible result values { 0, 1 }, so if any/every given (e.g.) 64-bit input value still summarily excludes its own result from being the result for all 64 of its single-bit-flip neighbors, then that now essentially imposes the other, only remaining value on those 64. The math breakdown we see in the table seems to be signalling that a simultaneous result under such conditions is a bridge too far.

In other words, the special 'information-preserving' characteristic of xor (that is, its luxuriously reliable guarantee that, as opposed to and, or, etc., it c̲a̲n̲ and w̲i̲l̲l̲ always change a bit) not surprisingly exacts a certain cost, namely, a fiercely non-negotiable demand for a certain amount of elbow room—at least 2 bits—to work with.

Glenn Slayden
  • 17,543
  • 3
  • 114
  • 108
6

I think this is the best you're going to get. You could compress the code to a single line but the var's are there for now as documentation:

function hash_32b_to_16b(val32b) {
    var rightBits = val32b & 0xffff; // Left-most 16 bits
    var leftBits = val32b & 0xffff0000; // Right-most 16 bits

    leftBits = leftBits >>> 16; // Shift the left-most 16 bits to a 16-bit value

    return rightBits ^ leftBits; // XOR the left-most and right-most bits
}

Given the parameters of the problem, the best solution would have each 16-bit hash correspond to exactly 2^16 32-bit numbers. It would also IMO hash sequential 32-bit numbers differently. Unless I'm missing something, I believe this solution does those two things.

I would argue that security cannot be a consideration in this problem, as the hashed value is just too few bits. I believe that the solution I gave provides even distribution of 32-bit numbers to 16-bit hashes

John Bledsoe
  • 17,142
  • 5
  • 42
  • 59
  • Why do you think this is the best? I think it can get an awful lot of collisions for useful and frequent numbers. – Rotsor Jun 17 '10 at 00:56
  • 3
    This isn't the best idea. The reason is that IP addresses are often assigned as contiguous subnets. This means that if the IP address A.B.C.D exists on a network then A.(B^1).C.D and A.B.C.(D^1) are slightly more likely to exist too and will get the same hash. Obviously any hash will have lots of collisions. But your scheme will have more collisions than you'd expect from hashing 32-bit integers picked uniformly. You'll get better results by churning up the bits a little more. – sigfpe Jun 17 '10 at 01:02
  • 1
    the criteria you used to assess the quality of the hash-function, hold even for the simpler one: hash = val&0xffff. However, these functions have different probability of collisions on real-life data. – Rotsor Jun 17 '10 at 01:33
  • @Rostor Ha, you are correct sir. The million-dollar question in all of this is the distribution of data that is in view. – John Bledsoe Jun 17 '10 at 03:09
3

This depends on the nature of the integers. If they can contain some bit-masks, or can differ by powers of two, then simple XORs will have high probability of collisions. You can try something like (i>>16) ^ ((i&0xffff) * p) with p being a prime number.

Security-hashes like MD5 are all good, but they are obviously an overkill here. Anything more complex than CRC16 is overkill.

Rotsor
  • 13,655
  • 6
  • 43
  • 57
  • This is an interesting point and apparently relevant to hashing IP addresses, yes? – dkamins Jun 17 '10 at 01:43
  • Yes. For time values i&0xffff should usually be enough. (hoping that there is no sleep(65536); anywhere :)) – Rotsor Jun 17 '10 at 02:07
  • 2
    There is no way to tell what will "suffice" unless you know exactly what input data you will have. The worst-case number of collisions will still be the same. Multiplication by a prime number just makes it harder to find a real-life situation which will produce collisions systematically. (how often is your delta-time a multiple of 1009?) Why primes are better at this [is a long discussion](http://stackoverflow.com/questions/1488977/why-multiply-by-a-prime-before-xoring-in-many-gethashcode-implementations) – Rotsor Jun 17 '10 at 09:22
  • Keep in mind that the number returned can be more then 16bit. You can do it like this `((i>>16) ^ ((i&0xffff) * p) & 0xffff)` (but i'm no expert) – clankill3r Nov 11 '18 at 21:30
2

I would say just apply a standard hash like sha1 or md5 and then grab the last 16 bits of that.

dreeves
  • 26,430
  • 45
  • 154
  • 229
  • Might there be issues with short input streams (like 4 bytes) for sha1 or md5? – dkamins Jun 17 '10 at 08:45
  • sh1 and md5 are typically not available in JavaScript environments. Are there slightly less secure but greatly simplified versions expressible in a few lines of JS? – dkamins Jun 17 '10 at 08:56
2

Assuming that you expect the least significant bits to 'vary' the most, I think you're probably going to get a good enough distribution by just using the lower 16-bits of the value as a hash.

If the numbers you're going to hash won't have that kind of distribution, then the additional step of xor-ing in the upper 16 bits might be helpful.

Of course this suggestion is if you're intending to use the hash merely for some sort of lookup/storage scheme and aren't looking for the crypto-related properties of non-guessability and non-reversability (which the xor-ing suggestions don't really buy you either).

Michael Burr
  • 333,147
  • 50
  • 533
  • 760
0

Something simple like this....

function hash_32b_to_16b(val32b) {    
    var h = hmac(secretKey, sha512);
    var v = val32b;
    for(var i = 0; i < 4096; ++i)
        v = h(v);
    return v % 0xffff;
}
yfeldblum
  • 65,165
  • 12
  • 129
  • 169
  • 2
    To slow it down. This is a common technique for hashing passwords, to make it orders of magnitude more difficult to create a rainbow table or brute force passwords. – yfeldblum Jun 17 '10 at 01:56