4

The C code of MurmurHash3 has this part:

  uint64_t k1 = 0;
  uint64_t k2 = 0;

  switch(len & 15)
  {
  case 15: k2 ^= ((uint64_t)tail[14]) << 48;
  case 14: k2 ^= ((uint64_t)tail[13]) << 40;
  case 13: k2 ^= ((uint64_t)tail[12]) << 32;
  case 12: k2 ^= ((uint64_t)tail[11]) << 24;
  case 11: k2 ^= ((uint64_t)tail[10]) << 16;
  case 10: k2 ^= ((uint64_t)tail[ 9]) << 8;
  case  9: k2 ^= ((uint64_t)tail[ 8]) << 0;

(The type of tail is uint8_t *)

As far as I can see it's no different than an OR operation. What difference does it make to use XOR here? Is it optimization? If it is, what kind of it is? Or am I missing something about behavior differences of those two operators?

I already know about the differences between XOR and OR. But in this case since the value is zeroed out at the beginning and xored values do not overlap, the behavior shouldn't be any different than OR at all. So I'm asking why author chose this over OR (which conveys its intent better than XOR imho).

Sedat Kapanoglu
  • 46,641
  • 25
  • 114
  • 148
  • Why would you use OR over XOR? –  Dec 05 '18 at 04:29
  • @Ivan to convey the intent better? XOR automatically implies that there is a specific behavior, like overlapping bit flips, needed here which isn't the case, probably. I'm still not sure though. – Sedat Kapanoglu Dec 05 '18 at 04:31
  • I don't think there's any performance difference between AND, XOR and OR, according to some references they all use 2 CPU operation cycles. I suppose the programmer chose to use XOR simply because `^` can be typed with a single hand and is easier to read. – Havenard Dec 05 '18 at 04:31
  • 1
    You are missing the fallthrough from one case to the next. This looks like a form of [Duff's Device](https://en.wikipedia.org/wiki/Duff%27s_device). – John Bollinger Dec 05 '18 at 04:31
  • 1
    @JohnBollinger I'm not missing it. But if you look at the indices and the shift levels , bytes don't overlap so XOR doesn't do its magic there and just works as an OR. – Sedat Kapanoglu Dec 05 '18 at 04:32
  • @TessellatingHeckler no it's just a single pass. code is not edited. you can see the full code here: https://github.com/aappleby/smhasher/blob/master/src/MurmurHash3.cpp – Sedat Kapanoglu Dec 05 '18 at 04:33
  • I highly doubt there is no difference between using `xor` and `or` just because you started with zero. Can you show the contents of the `tail` array? – smac89 Dec 05 '18 at 04:33
  • @SedatKapanoglu Depends on how you look at it. If author "thinks" in terms on XOR function, then XOR "conveys intent better". –  Dec 05 '18 at 04:34
  • @smac89 It's an array of `uint8_t`, that's all we need to know. – Havenard Dec 05 '18 at 04:34
  • @smac89 the contents don't matter because whatever the contents are, the individual bytes are laid out over k2 without any overlaps. I provided a link to the full code two comments above if you're curious. – Sedat Kapanoglu Dec 05 '18 at 04:35
  • XOR is supposedly faster because of circuit implementation of gates, but it's quite minor and unimportant. – SwiftMango Dec 05 '18 at 04:36
  • @Ivan I understand your point but what we're looking at is simply building an unsigned integer out of variable number of bytes. There is no secret algorithm or some mysterious thing. There is no reason for author to think about XOR here, unless I'm missing something of course. – Sedat Kapanoglu Dec 05 '18 at 04:37
  • 3
    @texasbruce The CPU is paced by a clock though, so the speed of the gate is of no consequence to the performance. – Havenard Dec 05 '18 at 04:37
  • Following on @texasbruce comment, there is a post on SO that goes more into this here: https://stackoverflow.com/questions/2308293/efficiency-of-bitwise-xor-in-c-in-comparison-to-more-readable-methods – cirrusio Dec 05 '18 at 04:37
  • 4
    Maybe it was designed to overlap before then the author changed it somehow. Anyway, `xor` feels more "cryptologic" – ZisIsNotZis Dec 05 '18 at 04:38
  • @ZisIsNotZis this is the most plausible answer I think. author probably used the notion of xors everywhere and didn't bother to change his approach here as the result isn't any different. – Sedat Kapanoglu Dec 05 '18 at 04:39
  • I'm talking about hardware circuit with transistors, not software. XOR gate uses less transistors and supposedly respond faster to a voltage change – SwiftMango Dec 05 '18 at 04:41
  • 2
    Well if you are saying that the author should have used `OR`, we might as well say that the author should have just used `ADD` then – smac89 Dec 05 '18 at 04:42
  • @texasbruce The CPU is still clocked at the hardware level. OR or XOR, the result will be available at the clock edge, not earlier. – DYZ Dec 05 '18 at 04:43
  • 1
    @smac89 ADD implies an arithmetic operation rather than bitwise. The intent here is clearly bitwise (overlaying bytes at correct bit positions to get an integer). OR is the best candidate for what's done here. – Sedat Kapanoglu Dec 05 '18 at 04:46
  • Different binary operation speed is different, which means the number of operations per CPU clock cycle is different. Also the latency is different – SwiftMango Dec 05 '18 at 04:52
  • Is this the true code? I'd expect `break;` on each line. – chux - Reinstate Monica Dec 05 '18 at 05:06
  • @chux yes, it's legit. when you expect rest of the statements to run after the specific case, you can use this kind of construct. – Sedat Kapanoglu Dec 05 '18 at 05:10
  • @Of course it is legit, yet still unusual. And you are looking for for reasons why code is the way it is, so unusual things deserve vetting. Like why `uint64_t k1 = 0;`? – chux - Reinstate Monica Dec 05 '18 at 05:40
  • @chux he's practically building an integer out of variable number of bytes. the only alternative here is to use a loop. he might have thought this is as a performance optimization but i'm not sure if a loop would suffer terribly. i did my own implementation with a loop: https://github.com/ssg/HashDepot/blob/master/src/MurmurHash3.cs#L121 – Sedat Kapanoglu Dec 05 '18 at 05:42

2 Answers2

1

Yes, in this case they are completely equivalent. Furthermore, since they ARE equivalent, the compiler may use this for optimization on its own. When you compile, you will have no guarantee that it will actually be or xor xor. Actually, on a more general level, you have no guarantee that it will be any of them, as long as the compiler produces code whose observable behavior is identical.

A reasonable reason for using xor is that it was the first thing that came to mind for the programmer in question, or that the code originally was written in a way where it mattered but later was changed into a version where it did not matter. But since they are equivalent in this case it is very hard to know.

klutt
  • 30,332
  • 17
  • 55
  • 95
  • 1
    The code just converts variable number of bytes into an integer. This has been done with OR for centuries now. I see no reason for author to think XOR first, unless he's been dealing with a lot of XORs lately. – Sedat Kapanoglu Dec 05 '18 at 04:40
  • 1
    @SedatKapanoglu Which could be the case, and because it is orthodox it gets attention. – klutt Dec 05 '18 at 04:48
  • I certainly don't disregard that possibility but just wanted to know if I'm missing something in terms of performance or behavior. Apparently, there isn't any. – Sedat Kapanoglu Dec 05 '18 at 04:50
  • 1
    @SedatKapanoglu Reworded it – klutt Dec 05 '18 at 04:51
  • I still think it's just because `^` is both easier to type and easier to read. `|` has a bunch of ambiguous-looking symbols and you typically need two hands to type it (it's actually a pain to find it in some keyboards). Also `^` makes people come to SO wondering if they are crazy, which is always a bonus. – Havenard Dec 05 '18 at 04:57
  • @Havenard interesting, two more people said the same. but i type ^ with two hands (right shift + 6) because it's quite far from left shift and i have to strecth my hand. on the other hand, i type pipe with a single hand (right shift + \\). – Sedat Kapanoglu Dec 05 '18 at 05:03
  • @SedatKapanoglu Oh, yeah I forgot that detail. I use an ABNT2 keyboard which has a [slightly different layout](https://www.smartkeyboardsolutions.com/images/Brazil-10063-zoom.gif). Though I'm pretty sure US keyboards suffer from lack of `|` too, specially on laptops. In fact I saw a keyboard once that simply didn't have this symbol anywhere. – Havenard Dec 05 '18 at 05:07
-1

Why use XOR over OR?

When one can use | or ^ in this restrictive code to get and archive the same functionality, then the preferred one should reflect larger issues.

^ retains entropy @Nominal Animal.

When code is seeking to form a hash (as here with MurmurHash3), ^ is better then |. ^ flips bits resulting, in general, to a fair distribution of 1s and 0s. | biases to making 1s.

Many hash algorithm will "add" a and b as with binary addition with no carries, which is to say a ^ b and not a | b. So in this hash algorithm context, ^ conveys the better algorithm intent.


Form time to time I have come across hashing code that does use | and unfortunately results in a biased results, whereas ^ would have worked fine. IMO, a | in hash code is a red flag that biasing is possible.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
  • 1
    In hash algorithms, the core difference between 'or' and 'xor' is that 'xor' operation retains the entropy ("randomness") in the arguments, while 'or' does not. 'xor' of different uniform random sources is still uniform random, but 'or' is not, as the binary result will have more ones than zeroes ("biased", as chux described). – Nominal Animal Dec 05 '18 at 08:31
  • In this case, there is no loss of entropy from use of OR because bit placements don't overlap. That part of code isn't cryptography related either; it's just a code that converts variable number of bytes to an integer. – Sedat Kapanoglu Dec 05 '18 at 20:39
  • @SedatKapanoglu Yes we all agree no entropy with `^` or `|` _in this code_. But the question is "Why use XOR over OR". `^` is more often used the `|` with hash functions that would otherwise have an entropy loss. Using `^` is therefore the more conventional notation with hashing and `|` would serve to bring about concerns. Even if you find `|` "conveys its intent better" than `^`, it is the opposite with hash algorithms: `^` conveys its intent better then `|`. – chux - Reinstate Monica Dec 05 '18 at 21:14
  • @chux I understand where you're coming from but my question is specifically about when OR and XOR are interchangeable, like in the example. The entropy case you mentioned means they are not interchangeable. – Sedat Kapanoglu Dec 05 '18 at 22:04