Bitwise operation over a simple piece of code

Question

Recently I came across a code that can compute the largest number given two numbers using XOR. While this looks nifty, the same thing can be achieved by a simple ternary operator or an if else. Not pertaining to just this example, but do bitwise operations have any advantage over normal code? If so, is this advantage in speed of computation or memory usage? I am assuming in bitwise operations the assembly code will look much simpler than normal code. On a related note, while programming embedded systems which is more efficient?

*Normal code refers to how you'd normally do it. For example a*2 is normal code and I can achieve the same thing with a<<1

I don't see how having more than one ALU is relevant. Can you describe the question? — Mathews_M_J, Sep 24 '15 at 03:35
Then you can do more than one XORs per cycle, but how do you branch more than once per cycle? — user3528438, Sep 24 '15 at 03:37
It is not really meaningful to discuss this without a specific hardware in mind. For example, branch prediction is not something you need to worry about at all for low-end MCU cores. — Lundin, Sep 24 '15 at 06:34
Why do you think _bit-operators_ are *not** "normal code"? Who defines what is normal and what is not? There is only correct or incorrect code. The latter invokes undefined behaviour and/or does not generate the desired result. — too honest for this site, Sep 24 '15 at 10:38
@Olaf : I just wanted to differentiate between bitwise operation and otherwise. 'Normal code' was just a term to describe the question. Nothing more — Mathews_M_J, Sep 24 '15 at 17:50

score 2 · Answer 1 · answered Sep 24 '15 at 03:27

Bitwise operators generally have the advantage of being constant time, regardless of input values. Conditional moves and branches may be the target of timing attacks in certain applications, such as crypto libraries, while bitwise operations are not subject to such attacks. (Disregarding cache timing attacks, etc.)

Generally, if a processor is capable of pipelining, it would be more efficient to use bitwise operations than conditional moves or branches, bypassing the entire branch prediction problem. This may or may not speed up your resulting code.

You do have to be careful, though, as some operations constitute undefined behavior in C, such as shifting signed integers, etc. For this reason, it may be to your advantage to do things the "normal" way.

Sergey Kalinichenko · Accepted Answer · 2015-09-24T03:35:29.867

2

do bitwise operations have any advantage over normal code?

Bitwise operations are normal code. Most compilers these days have optimizers that generate the same instruction for a << 1 as for a * 2. On some hardware, especially on low-powered microprocessors, shift operations take fewer CPU cycles than multiplication, but there is hardware on which this makes no difference.

In your specific case there is an advantage, though: the code with XOR avoids branching, which has a great potential of speeding up the code. When there is no branching, CPU can use pipelining to perform the same operations much faster.

when programming embedded systems which is more efficient?

Embedded systems often have less powerful CPUs, so bitwise operations do have an advantage. For example, on 68HC11 CPU multiplication takes 10 cycles, while shifting left takes only 3.

Note, however, that it does not mean that you should be using bitwise operations explicitly. Most compilers, including embedded ones, will convert multiplication by a constant to a sequence of shifts and additions in case it saves CPU cycles.

edited Sep 24 '15 at 03:35

answered Sep 24 '15 at 03:28

Sergey Kalinichenko

714,442
84
1,110
1,523

Almost everyone said 'avoids branching' and can lead to 'pipelining'. This might sound like a stupid question since my knowledge on pipelining is borderline non existent. What does branching have anything to do with multiple processes? Does the processor have to wait until the branching is done and then get to the next thread? Is that why? – Mathews_M_J Sep 24 '15 at 03:33
@Mathews_M_J A single CPU has several "stages" in a pipeline - loading the instruction from memory, interpreting the instruction, loading operands, performing the instruction, saving the result, etc. Pipeline lets CPU execute several "stages" at the same time, speeding up computation by a number of "loaded" stages: instruction A loading state can proceed in parallel with interpretation stage for instruction B, and also with processing of instruction C. – Sergey Kalinichenko Sep 24 '15 at 03:43
@Mathews_M_J Branching on a condition (e.g. what happens when you write a ternary operator) breaks this, because CPU does not know what instruction to load/interpret next. It must wait for the branched instruction to finish processing before deciding on what to read next. [Branch prediction](http://stackoverflow.com/q/11227809/335858) helps a lot, but it's better to not branch at all. – Sergey Kalinichenko Sep 24 '15 at 03:48
@Mathews_M_J And something to notice is that execution stage is usually not the longest in the pipeline. For modern RISC CPUs most of the instructions are single cycle, a few are 2-4 cycles, and very few exceed more than 10 cycles. However instruction fetching and decoding easily exceeds more than 5 cycles. Those are cycles to wait before the branched code can execute, aka branch penalty. The deeper the pipeline, the worse it is. – user3528438 Sep 24 '15 at 03:50
Then it actually leads to an interesting observation: removing branching only benefit those half-way smart CPUs. Ones as simple as MSP430 has such a shallow pipeline that branching has no penalty. Ones as smart as modern x86 has great prediction that eliminate the need, too. However, among those half-way smart ones like ARM, it has conditional execution that remove the branches automatically for you. Taking into account those ultra-smart compilers, one should really think twice before using those old-time tricks. – user3528438 Sep 24 '15 at 03:58
@user3528438: On the MSP430, a branch taken still has a penalty. For an `if ... else`, you have two jumps in the `if` branch, not to mention the branch instructions themselves which each occupy 2 bytes of code space. These problems are reason some CPUs have conditional instructions. – too honest for this site Sep 24 '15 at 10:47
The 68HC11 can only shift one bit at a time, so once you shift more than 3 bits, the multiplication is faster. Worse with the multiplication can be it uses the B register for the other half of the result, so any value there has to be saved/restored. – too honest for this site Sep 24 '15 at 10:49

score 1 · Answer 3 · answered Sep 24 '15 at 03:25

On some platforms branches are expensive, so finding a way to get the min(x,y) without branching has some merit. I think this is particularly useful in CUDA, where the pipelines in the hardware are long.

Of course, on other platforms (like ARM) with conditional execution and compilers that emit those op-codes, it boils down to a compare and a conditional move (two instructions) with no pipeline bubble. Almost certainly better than a compare, and a few logical operations.

score 0 · Answer 4 · answered Sep 24 '15 at 11:41

Since the poster asks it with the Embedded tag listed, I will try to reflect primarily that in my answer.

In short, usually you shouldn't try to be "creative" with your coding, since it becomes harder to understand later! (The old saying, "premature optimization is the root of all evils")

So only do anything alike when you know what you are doing, precisely, and in any other case, try to write the most understandable C code.

Well, this was the general part, now lets get on what such tricks could do, how they could affect the execution time.

First thing, in embedded, it is good to check the disassembly listing. If you use a variant of GCC with -O2 optimizations, you can usually assume it is quite clever understanding what the code is meant to do, and will produce the result which is likely fine. It can even use such tricks by itself figuring out the code, if it "sees" that it will be faster on the target CPU, so you don't need to ruin the understandability of your code with tricks. With other compilers, results may vary, in doubt, the assembly listing should be observed to see if execution times could be improved utilizing such bit hack tricks.
On the usual embedded platform, especially at 8 bits, you don't need to care that much about pipeline (and related, branch mispredictions) since it is short (or nonexistent). So you usually gain nothing by eliminating a conditional at the cost of an arithmetic operation, and could actually ruin performance by utilizing some elaborate hacks.
On faster 32 bit CPUs usually there is a longer pipeline and branch predictor to eliminate flushes (costing many cycles), so eliminating conditionals may pay off. But only if they are of such nature that the branch predictor can not guess them right (such as comparisons on "random" data), otherwise the conditionals may still be the better, taking the most minimal time (single cycle or even "less" if the CPU is capable to process more than one operation per cycle) when they were predicted right.

Bitwise operation over a simple piece of code

4 Answers4