-1

It is my understanding that XOR messes with branch prediction.

Is it preferable to remove bits via subtraction or via xor for operations that will run a great many times?

// For an operation that will run several million times ...
int encoding = 180391281; // could be anything but we'll pick a number ...
#define REMOVE_BITS (128+64)

int bits_to_remove = encoding & REMOVE_BITS;
int encoding_with_bits_removed_xor_method      = encoding ^ /*XOR*/ bits_to_remove; // BEST?
int encoding_with_bits_removed_subtract_method = encoding - /*SUB*/ bits_to_remove; // BEST?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
B. Nadolson
  • 2,988
  • 2
  • 20
  • 27
  • It's unlikely there is any difference in performance. Both operations map to simple operation on cpu registers – tstanisl May 03 '21 at 06:46
  • You try both and benchmark – klutt May 03 '21 at 06:46
  • 1
    Build with optimizations enabled, and compare the generated assembly code. – Some programmer dude May 03 '21 at 06:47
  • 6
    But xor and subtraction is NOT a proper way to unset bits. Apply ANDing with negated mask. – tstanisl May 03 '21 at 06:47
  • @ tstanisl - thanks. @others - I'm trying to gain understanding of the appropriate/best method for this, benchmarking would work but I would gain no understanding of why. – B. Nadolson May 03 '21 at 06:52
  • @Some programmer dude - looking at the asm doesn't tell me if XOR interferes with processor level branch-prediction, only compiler branch prediction. – B. Nadolson May 03 '21 at 06:55
  • 2
    As a couple of general tips about hand-optimizing code: Micro-optimizations like these are almost never worth it. And *always* benchmark and measure to find the top *two* bottlenecks in your program, and concentrate your efforts on those parts only. Lastly, hand-optimizing code will usually make the code very complicated, hard to read, understand and maintain, so always include plenty of documentation and comments about the code, and why it does what it does. – Some programmer dude May 03 '21 at 06:57
  • 2
    @B.Nadolson "*It is my understanding that XOR messes with branch prediction.*" Quotation? – dxiv May 03 '21 at 07:41
  • @dxiv: Yeah, that seems *incredibly* unlikely. I wonder if the querent is thinking of x86 [macro-fusion of `sub/jcc` into a single sub-and-branch uop](https://stackoverflow.com/a/31778403/224132) being possible on Intel Sandybridge-family, but `xor/jcc` not being fusable. Nothing to do with *prediction*, instead with front-end bandwidth. And totally irrelevant if you aren't branching on the result. – Peter Cordes May 03 '21 at 08:29
  • @B.Nadolson: What ISA (and microarchitectures) are you tuning for? What happens to the result? Are you branching on it, storing it to memory, ANDing it with something else...? Are you sure these bits will always be set in the integers you want to remove them from, so it's safe to actually toggle them to 0, rather than unconditionally force them to 0 with `&`? – Peter Cordes May 03 '21 at 08:36
  • @dxiv I don't have the quotation but I'm damned sure I read it on a technical site, but the context is unknown and I don't remember and can't find it, I searched for it before posting the question. So I am an "unreliable narrator" here, unfortunately. Peter - I'm not tuning for any particular architecture, I have a rather huge source file that works with encodings and am trying to get it close to ideal the first time so I don't need to revisit. – B. Nadolson May 03 '21 at 12:13

1 Answers1

1

If you dig into digital electronic circuits; for bitwise boolean operations (AND, OR, XOR, XNOR) each bit can be done in parallel, and for addition/subtraction they can't be (the carry/borrow from the lowest pair of bits influences the result for the next highest pair of bits, which ...).

However; CPUs are complex and to keep all the pieces synchronized everything typically (excluding extremely rare asynchronous/clockless designs like GreenArrays' chip) gets rounded up to the nearest clock cycle; so an XOR that could be done in 0.5 cycles will cost 1 whole cycle and a SUB that could be done in 0.9 cycles will cost the same 1 whole cycle.

In other words; in general it's extremely unlikely that there will be any difference in performance between XOR (or AND) and SUB (but I'd still prefer XOR or AND because it's "simpler for the CPU in theory" and might cost slightly less power consumption).

A more important consideration (especially for high level languages where there's a compiler that's supposed to do micro-optimizations for you) is source code readability. For bit removal, AND is common practice and SUB is unusual (more likely to confuse the reader briefly). XOR is reasonable if it helps you avoid a NOT (e.g. a = b ^ c isn't harder to understand than a = b & (~c)); but you may be able to invert the terminology (e.g. the words "bits to remove" replaced with "bit mask") and end up with a = b & c without the NOT, and if you can do that it's likely to improve readability.

Brendan
  • 35,656
  • 2
  • 39
  • 66
  • 1
    For removing bits you already know are set in some value, there are advantage to `-`: the compiler can use `lea reg, [reg - (128+64)]` to copy-and-subtract a constant, if it wants to keep the original around to simply end up with the result in a different register. Also, if you want to branch on the result being 0 or something, `sub/jnz` can [macro-fuse](https://stackoverflow.com/a/31778403/224132) into a single sub-and-branch uop on Intel. (But so can AND, unlike XOR). Fusion is my only guess at what the querent might be mixing up with "branch prediction". – Peter Cordes May 03 '21 at 08:34