190

Say you have two hashes H(A) and H(B) and you want to combine them. I've read that a good way to combine two hashes is to XOR them, e.g. XOR( H(A), H(B) ).

The best explanation I've found is touched briefly here on these hash function guidelines:

XORing two numbers with roughly random distribution results in another number still with roughly random distribution*, but which now depends on the two values.
...
* At each bit of the two numbers to combine, a 0 is output if the two bits are equal, else a 1. In other words, in 50% of the combinations, a 1 will be output. So if the two input bits each have a roughly 50-50 chance of being 0 or 1, then so too will the output bit.

Can you explain the intuition and/or mathematics behind why XOR should be the default operation for combining hash functions (rather than OR or AND etc.)?

Nate Murray
  • 3,841
  • 5
  • 32
  • 33
  • 25
    I think you just did ;) – Massa May 04 '11 at 20:13
  • 25
    note that XOR may or may not be a "good" way to "combine" hashes, depending on what you want in a "combination". XOR is commutative: XOR(H(A),H(B)) is equal to XOR(H(B),H(A)). This means that XOR is not a proper way to create a kind of hash of an ordered sequence of values, since it does not capture the order. – Thomas Pornin May 05 '11 at 13:46
  • 6
    Besides the issue with order (comment above), there is problem with equal values. XOR(H(1), H(1))=0 (for any function H), XOR(H(2),H(2))=0 and so on. For any N: XOR(H(N),H(N))=0. Equal values happens quite often in real apps, it means result of XOR will be 0 too often to be considered as good hash. – Andrei Galatyn Apr 06 '16 at 06:10
  • What do you use for ordered sequence of values ? Let's say I'd like to create a hash of timestamp or index. (MSB less important than LSB). Sorry if this thread is 1year old. – Alexis Apr 08 '17 at 09:07
  • Related: [What is the best algorithm for an overridden System.Object.GetHashCode?](http://stackoverflow.com/q/263400/11683) – GSerg May 17 '17 at 20:41
  • A word of warning: don't use XOR to combine CRC values because CRC is a linear function in the sense that CRC(a) ^ CRC(b) = CRC(a ^ b). Additionally, two equal elements will cancel out. I think summing CRC values (with addition) is okay if you want a hash of an unordered list, but I'm not 100% on that. – Dan Stahlke Mar 12 '19 at 17:54

9 Answers9

231

xor is a dangerous default function to use when hashing. It is better than and and or, but that doesn't say much.

xor is symmetric, so the order of the elements is lost. So "bad" will hash combine the same as "dab".

xor maps pairwise identical values to zero, and you should avoid mapping "common" values to zero:

So (a,a) gets mapped to 0, and (b,b) also gets mapped to 0. As such pairs are almost always more common than randomness might imply, you end up with far to many collisions at zero than you should.

With these two problems, xor ends up being a hash combiner that looks half decent on the surface, but not after further inspection.

On modern hardware, adding usually about as fast as xor (it probably uses more power to pull this off, admittedly). Adding's truth table is similar to xor on the bit in question, but it also sends a bit to the next bit over when both values are 1. This means it erases less information.

So hash(a) + hash(b) is better than hash(a) xor hash(b) in that if a==b, the result is hash(a)<<1 instead of 0.

This remains symmetric; so the "bad" and "dab" getting the same result remains a problem. We can break this symmetry for a modest cost:

hash(a)<<1 + hash(a) + hash(b)

aka hash(a)*3 + hash(b). (calculating hash(a) once and storing is advised if you use the shift solution). Any odd constant instead of 3 will bijectively map a "k-bit" unsigned integer to itself, as map on unsigned integers is math modulo 2^k for some k, and any odd constant is relatively prime to 2^k.

For an even fancier version, we can examine boost::hash_combine, which is effectively:

size_t hash_combine( size_t lhs, size_t rhs ) {
  lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  return lhs;
}

here we add together some shifted versions of lhs with a constant (which is basically random 0s and 1s – in particular it is the inverse of the golden ratio as a 32 bit fixed point fraction) with some addition and an xor. This breaks symmetry, and introduces some "noise" if the incoming hashed values are poor (ie, imagine every component hashes to 0 – the above handles it well, generating a smear of 1 and 0s after each combine. My naive 3*hash(a)+hash(b) simply outputs a 0 in that case).

Extending this to 64 bits (using the expansion of pi as our constant for 64 bits, as it is odd at 64 bits):

size_t hash_combine( size_t lhs, size_t rhs ) {
  if constexpr (sizeof(size_t) >= 8) {
    lhs ^= rhs + 0x517cc1b727220a95 + (lhs << 6) + (lhs >> 2);
  } else {
    lhs ^= rhs + 0x9e3779b9 + (lhs << 6) + (lhs >> 2);
  }
  return lhs;
}

(For those not familiar with C/C++, a size_t is an unsigned integer value which is big enough to describe the size of any object in memory. On a 64 bit system, it is usually a 64 bit unsigned integer. On a 32 bit system, a 32 bit unsigned integer.)

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • Nice answer Yakk. Does this algorithm work equally well on both 32bit and 64bit systems? Thanks. – Dave Oct 21 '15 at 00:39
  • 1
    @dave add more bits to `0x9e3779b9`. – Yakk - Adam Nevraumont Oct 21 '15 at 01:48
  • @Yakk Thanks. For anyone else listening, I doubled the binary bits of the 32bit case ( 0x9e3779b9 ) for a 64bit value of ( 0x9e3779b99e377800 ) and switch which to use by testing cpp macros __i386__ (32 bit intel) and __x86_64__ (64 bit intel) – Dave Nov 04 '15 at 00:49
  • @dave use a base 2 fractional irrational value for max entropy. – Yakk - Adam Nevraumont Nov 04 '15 at 00:52
  • @Yakk Oh! Thank you, I'd forgotten the number wasn't just any constant, which you explained so well above. :) Using your inverse of the golden ratio as a 64 bit fixed point number, I come up with this, which I'll use instead for the 64bit case: 0x9e3779b97f492000. Does it matter that this constant is even? Would it be better to add a one to the end of it? – Dave Nov 04 '15 at 03:02
  • @Dave Not sure; but it ending with 000 is suspicious; that value probably has `double` bits of precision, not 64. – Yakk - Adam Nevraumont Nov 04 '15 at 03:32
  • @Yakk I used a couple online converters to come up with the numbers (probably written in javascript), so hmm... you're right, I doubt anyone is trying to be more precise than double. I'll re-examine. Also, oops stack overflow formatting prints the macros wrong in my earlier comment. They are \_\_i386\_\_ and \_\_x86_64\_\_ (with leading and trailing double-underlines) – Dave Nov 04 '15 at 03:37
  • 14
    OK, to be complete... here is the full precision 64bit constant (calculated with long doubles, and unsigned long longs): 0x9e3779b97f4a7c16. Interestingly it is still even. Re-doing the same calculation using PI instead of the Golden Ratio produces: 0x517cc1b727220a95 which is odd, instead of even, thus probably "more prime" than the other constant. I used: std::cout << std::hex << (unsigned long long) ((1.0L/3.14159265358979323846264338327950288419716939937510L)*(powl(2.0L,64.0L))) << std::endl; with cout.precision( numeric_limits::max_digits10 ); Thanks again Yakk. – Dave Nov 04 '15 at 04:22
  • 3
    @Dave the inverse golden ratio rule for these cases is the first _odd_ number equal to or larger than the calculation you are doing. So just add 1. It is an important number because the sequence of N * the ratio, mod the max size (2^64 here) places the next value in the sequence exactly at that ratio in the middle of the largest 'gap' in numbers. Search the web for "Fibonacci hashing" for more info. – Scott Carey Jan 05 '17 at 23:18
  • 1
    @Dave the right number would be 0.9E3779B97F4A7C15F39... See [link](https://en.wikipedia.org/wiki/Golden_ratio). You're could be suffering from the round-to-even rule (which is good for accountants), or simply, if you start with a literal sqrt(5) constant, when you subtract 1, you remove the high order bit, a bit must have been lost. – migle Jan 08 '18 at 16:52
  • Also good, but a lot more expensive, would be hash(hash(a)) + hash(b). – migle Jan 08 '18 at 16:55
  • But, wait, is it 0x0.9e377... or 0x9e377 ? Sorry getting confused since the 32bit version int the main answer uses 0x9e377... – Dave Jan 16 '18 at 20:44
  • @Dave The hash constant is a fixed-point hex decimal. The decimal isn't part of the encoding, as it is implicitly before the most significant byte of the value. This is a bit confusing as C++ has (recently?) added hex floating point literals, but prior to that a decimal point and hex values wasn't legal C++. In short, omit the decimal point. – Yakk - Adam Nevraumont Jan 16 '18 at 21:04
  • @Dave, just reading your comments after some years... I think, instead of testing the macros, it would be better to just have two overloads for `uint32_t` and `uint64_t`. – gigabytes Mar 20 '19 at 14:36
  • In the last paragraph `seed` should probably be changed with `lhs`. Great answer! – manlio Oct 28 '20 at 08:21
  • "this means it erases less information" - no. There is the same amount of information when you add two random numbers and truncate or when you xor them. Both results have maximum entropy. The rest is still true though. – Wolfgang Brehm Aug 28 '21 at 08:59
  • Except, sometimes we *want* our hash to be order agnostic, e.g., when trying to hash an unordered collection. – Peter Gerdes May 23 '22 at 10:49
  • In 2022, it might make sense to present the 64 bit variant by default, as this is the canonical answer for how to combine hashes in C++ on SO. – Baum mit Augen Nov 30 '22 at 22:18
  • we need some way to retrieve the "magic" constants so we won't need the if constexpr. Also, the the choice of inverse pi on 64-bit systems seems arbitrary to me. Why not just stick to the inverse golden ratio or, alternatively, switch to the inverse pi everywhere? – user1095108 Feb 01 '23 at 13:19
141

Assuming uniformly random (1-bit) inputs, the AND function output probability distribution is 75% 0 and 25% 1. Conversely, OR is 25% 0 and 75% 1.

The XOR function is 50% 0 and 50% 1, therefore it is good for combining uniform probability distributions.

This can be seen by writing out truth tables:

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

 a | b | a OR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    1

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

Exercise: How many logical functions of two 1-bit inputs a and b have this uniform output distribution? Why is XOR the most suitable for the purpose stated in your question?

Greg Hewgill
  • 951,095
  • 183
  • 1,149
  • 1,285
  • 27
    answering to the exercise: from the 16 possible different a XXX b operations `(0, a & b, a > b, a, a < b, b, a % b, a | b, !a & !b, a == b, !b, a >= b, !a, a <= b, !a | !b, 1)`, the following have 50%-50% distributions of 0s and 1s, assuming a and b have 50%-50% distributions of 0s and 1s: `a, b, !a, !b, a % b, a == b`, i. e., the opposite of XOR (EQUIV) could have been used as well... – Massa May 04 '11 at 20:25
  • 9
    Greg, this is an awesome answer. The light bulb went on for me after I saw your original answer and wrote out my own truth tables. I considered @Massa's answer about how there are 6 suitable operations for maintaining the distribution. And while `a, b, !a, !b` will have the same distribution as their respective inputs, you lose the entropy of the other input. That is, XOR is most suitable for the purpose of combining hashes because we want to capture entropy from both a and b. – Nate Murray May 04 '11 at 21:34
  • 1
    [Here is a paper](http://crypto.stanford.edu/~dabo/abstracts/hashing.html) that explains that combining hashes securely where each function is called only once is not possible without outputting less bits than the sum of number of bits in each hash value. This suggest that this answer is not correct. – Tamás Szelei Jul 23 '12 at 10:28
  • @fish: That paper describes building secure hashes from a secure/possibly-insecure pair. I saw nothing about combining two secure hashes. In any event, I think this discussion has more to do with the use of hashes in randomised algorithms (where there are numerous good tricks that will do the job) than in cryptography, where a huge amount of care must be taken to thwart cryptanalysis. – Marcelo Cantos Apr 24 '13 at 22:05
  • 3
    @Massa I've never seen % used for XOR or not equal. – Buge Aug 15 '14 at 15:33
  • @GregHewgill, I know this thread is old; trying my luck. will XOR(`A`,`B`) will generate a unique bit sequence if `A` and `B` are unique and have same length? – mrtpk Sep 07 '16 at 07:16
  • 1
    @tpk: No, the result is not unique. There are many different ways to generate a given result R from R = A XOR B. For example, consider 0010 XOR 1100, and 1111 XOR 0001. Both give the result 1110. – Greg Hewgill Sep 07 '16 at 07:39
  • 8
    As [Yakk points out](http://stackoverflow.com/a/27952689/24874), XOR can be dangerous as it produces zero for identical values. This means `(a,a)` and `(b,b)` both produce zero, which in many (most?) cases greatly increases the likelihood of collisions in hash-based data structures. – Drew Noakes Nov 15 '16 at 12:38
  • 2
    @2943 consider XORing two bytes has 256*256 possible input values, and only 256 output values. It's not possible to come up with a unique output given two inputs, assuming all three values have the same options. – Drew Noakes Nov 15 '16 at 12:40
  • This is not really a very good answer. It addresses the matter probabilistically, without considering cross probabilities. The question was "Why is XOR the default way to combine hashes", and XOR shouldn't be the default, because there will probably be a relation between the two values (two small integers, two letters, etc). And it gets a lot worse if more than two hashes are being combined. – migle Jan 08 '18 at 16:19
  • 1
    Another way to think about this: XOR is reversible: it doesn't destroy information. You can XOR the same thing again to flip the bits back to what they were. AND and OR aren't reversible. – Peter Cordes Dec 12 '18 at 10:54
33

In spite of its handy bit-mixing properties, XOR is not a good way to combine hashes due to its commutativity. Consider what would happen if you stored the permutations of {1, 2, …, 10} in a hash table of 10-tuples.

A much better choice is m * H(A) + H(B), where m is a large odd number.

Credit: The above combiner was a tip from Bob Jenkins.

Marcelo Cantos
  • 181,030
  • 38
  • 327
  • 365
  • 2
    Sometimes commutativity is a good thing, but xor is a lousy choice *even then* because all pairs of matching items will get hashed to zero. An arithmetic sum is better; the hash of a pair of matching items will retain only 31 bits of useful data rather than 32, but that's a lot better than retaining zero. Another option may be to compute the arithmetic sum as a `long` and then munge the upper portion back in with the lower portion. – supercat Oct 02 '13 at 15:09
  • 1
    `m = 3` is actually a good choice and very fast on many systems. Note that for any odd `m` integer multiplication is modulo `2^32` or `2^64` and is therefore invertible so you're not losing any bits. – StefanKarpinski Apr 21 '14 at 20:34
  • What happens when you go beyond MaxInt? – disruptive Jun 26 '14 at 11:16
  • 2
    instead of any odd number one should choose a prime – TermoTux Sep 15 '14 at 03:50
  • XOR is fine, if you are combining two _different_ quality hash functions. For example SHA1(A) XOR SipHash(B) (mashed together at equal length, of course) – Scott Carey Jan 05 '17 at 23:22
  • 2
    @Infinum that's not necessary when combining hashes. – Marcelo Cantos Jan 05 '17 at 23:28
  • Why not do `H(H(A) || H(B))` where `||` is concatenate? – Casey Rodarmor Mar 27 '19 at 09:27
  • @CaseyRodarmor you could, but stringifying and concatenating two hashes and then computing a third hash is far more expensive than a multiplication and an addition for no improvement in the quality of the hash. – Marcelo Cantos Mar 27 '19 at 19:33
18

Xor may be the "default" way to combine hashes but Greg Hewgill's answer also shows why it has its pitfalls: The xor of two identical hash values is zero. In real life, there are identical hashes are more common than one might have expected. You might then find that in these (not so infrequent) corner cases, the resulting combined hashes are always the same (zero). Hash collisions would be much, much more frequent than you expect.

In a contrived example, you might be combining hashed passwords of users from different websites you manage. Unfortunately, a large number of users reuse their passwords, and a surprising proportion of the resulting hashes are zero!

Leo Goodstadt
  • 2,519
  • 1
  • 23
  • 23
8

There's something I want to explicitly point out for others who find this page. AND and OR restrict output like BlueRaja - Danny Pflughoe is trying to point out, but can be better defined:

First I want to define two simple functions I'll use to explain this: Min() and Max().

Min(A, B) will return the value that is smaller between A and B, for example: Min(1, 5) returns 1.

Max(A, B) will return the value that is larger between A and B, for example: Max(1, 5) returns 5.

If you are given: C = A AND B

Then you can find that C <= Min(A, B) We know this because there is nothing you can AND with the 0 bits of A or B to make them 1s. So every zero bit stays a zero bit and every one bit has a chance to become a zero bit (and thus a smaller value).

With: C = A OR B

The opposite is true: C >= Max(A, B) With this, we see the corollary to the AND function. Any bit that is already a one cannot be ORed into being a zero, so it stays a one, but every zero bit has a chance to become a one, and thus a larger number.

This implies that the state of the input applies restrictions on the output. If you AND anything with 90, you know the output will be equal to or less than 90 regardless what the other value is.

For XOR, there is no implied restriction based on the inputs. There are special cases where you can find that if you XOR a byte with 255 than you get the inverse but any possible byte can be output from that. Every bit has a chance to change state depending on the same bit in the other operand.

Corey Ogburn
  • 24,072
  • 31
  • 113
  • 188
4

If you XOR a random input with a biased input, the output is random. The same is not true for AND or OR. Example:

00101001 XOR 00000000 = 00101001
00101001 AND 00000000 = 00000000
00101001 OR  11111111 = 11111111

As @Greg Hewgill mentions, even if both inputs are random, using AND or OR will result in biased output.

The reason we use XOR over something more complex is that, well, there's no need: XOR works perfectly, and it's blazingly stupid-fast.

BlueRaja - Danny Pflughoeft
  • 84,206
  • 33
  • 197
  • 283
3

Cover the left 2 columns and try to work out what the inputs are using just the output.

 a | b | a AND b
---+---+--------
 0 | 0 |    0
 0 | 1 |    0
 1 | 0 |    0
 1 | 1 |    1

When you saw a 1-bit you should have worked out that both inputs were 1.

Now do the same for XOR

 a | b | a XOR b
---+---+--------
 0 | 0 |    0
 0 | 1 |    1
 1 | 0 |    1
 1 | 1 |    0

XOR gives away nothing about it inputs.

Robert
  • 37,670
  • 37
  • 171
  • 213
1

XOR does not ignore some of the inputs sometimes like OR and AND.

If you take AND(X, Y) for example, and feed input X with false, then the input Y does not matter...and one probably would want the input to matter when combining hashes.

If you take XOR(X, Y) then BOTH inputs ALWAYS matter. There would be no value of X where Y does not matter. If either X or Y is changed then the output will reflect that.

SunsetQuest
  • 8,041
  • 2
  • 47
  • 42
0

The source code for various versions of hashCode() in java.util.Arrays is a great reference for solid, general use hashing algorithms. They are easily understood and translated into other programming languages.

Roughly speaking, most multi-attribute hashCode() implementations follow this pattern:

public static int hashCode(Object a[]) {
    if (a == null)
        return 0;

    int result = 1;

    for (Object element : a)
        result = 31 * result + (element == null ? 0 : element.hashCode());

    return result;
}

You can search other StackOverflow Q&As for more information about the magic behind 31, and why Java code uses it so frequently. It is imperfect, but has very good general performance characteristics.

kevinarpe
  • 20,319
  • 26
  • 127
  • 154
  • 2
    Java's default "multply by 31 and add / accumulate" hash is loaded with collisions (e.g. any `string` collides with `string + "AA"` IIRC) and they long ago wished they had not baked in that algorithm into the spec. That said, using a larger odd number with more bits set, and adding a shifts or rotations fixes that problem. MurmurHash3's 'mix' does this. – Scott Carey Jan 05 '17 at 23:27