Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

Question

Does using bitwise operations in normal flow or conditional statements like for, if, and so on increase overall performance and would it be better to use them where possible? For example:

if(i++ & 1) {

}

vs.

if(i % 2) {

}

This is a trick that everyone knows, including the compiler. In other cases, using bitwise operations can help. It depends on what the alternative is. — harold, Dec 05 '13 at 07:31
Why does the first condition include postfix increment of `i` while the second one doesn't? It is rather meaningless to compare two pieces of code for performance when they don't so the same thing. — AnT stands with Russia, Dec 05 '13 at 07:38
Integral division/modulo involving at least one compile-time constant will be optimised by the compiler into a series of multiplications and bit shifts. — Simple, Dec 05 '13 at 08:11
Personally I like (i & 0xFF) vs (i % 0x100). In base ten if I said "what are the right two digits of your birth year" you wouldn't think, "oh, the telephone operator wants to know my birth year modulo 100..." You'd just spit out the digits and forget about it. Why do people think mod is easier to read? Every time I see it I have to check if the variable is signed. It's not easier to read. Using (&) is not premature optimization at all. — Samuel Danielson, Jan 21 '16 at 02:44

Jerry Coffin · Accepted Answer · 2019-12-06T19:48:18.847

62

Unless you're using an ancient compiler, it can already handle this level of conversion on its own. That is to say, a modern compiler can and will implement i % 2 using a bitwise AND instruction, provided it makes sense to do so on the target CPU (which, in fairness, it usually will).

In other words, don't expect to see any difference in performance between these, at least with a reasonably modern compiler with a reasonably competent optimizer. In this case, "reasonably" has a pretty broad definition too--even quite a few compilers that are decades old can handle this sort of micro-optimization with no difficulty at all.

edited Dec 06 '19 at 19:48

answered Dec 05 '13 at 07:00

Jerry Coffin

476,176
80
629
1,111

13

If all values are signed integer types, the expression `a=b & 1;` will on every standards-compliant implementation I know evaluate faster than `a=b % 2;`, since the latter expression is equivalent to `a= b < 0 ? -(b & 1) : b & 1;`. If the only thing done with the result is testing for zero, an optimizer may be able to recognize that the `b<0` and `b>=0` cases are equivalent, but I wouldn't particularly expect that optimization. – supercat Dec 13 '13 at 22:00
1

@supercat There is a way to avoid branching here, just add the sign bit to the mask: `a = b & 0b1000....0001` – potato May 21 '20 at 10:06
1

@potato: That wouldn't yield correct results on a two's-complement machine. There are branchless ways to achieve the result, and compilers in fact generate them, but *they're more complicated than the code for `b & 1`.* – supercat May 21 '20 at 14:27

score 43 · Answer 2 · answered Dec 05 '13 at 07:50

TL;DR Write for semantics first, optimize measured hot-spots second.

At the CPU level, integer modulus and divisions are among the slowest operations. But you are not writing at the CPU level, instead you write in C++, which your compiler translates to an Intermediate Representation, which finally is translated into assembly according to the model of CPU for which you are compiling.

In this process, the compiler will apply Peephole Optimizations, among which figure Strength Reduction Optimizations such as (courtesy of Wikipedia):

Original Calculation  Replacement Calculation
y = x / 8             y = x >> 3
y = x * 64            y = x << 6
y = x * 2             y = x << 1
y = x * 15            y = (x << 4) - x

The last example is perhaps the most interesting one. Whilst multiplying or dividing by powers of 2 is easily converted (manually) into bit-shifts operations, the compiler is generally taught to perform even smarter transformations that you would probably think about on your own and who are not as easily recognized (at the very least, I do not personally immediately recognize that (x << 4) - x means x * 15).

Explanation of the last example: `x * 15` == `x * (16 - 1)` == `(x * 16) - (x * 1)` == `(x << 4) - x`. — Yay295, Aug 22 '22 at 19:21

Tony Delroy · Answer 3 · 2021-04-01T21:19:48.993

This is obviously CPU dependent, but you can expect that bitwise operations will never take more, and typically take less, CPU cycles to complete. In general, integer / and % are famously slow, as CPU instructions go. That said, with modern CPU pipelines having a specific instruction complete earlier doesn't mean your program necessarily runs faster.

Best practice is to write code that's understandable, maintainable, and expressive of the logic it implements. It's extremely rare that this kind of micro-optimisation makes a tangible difference, so it should only be used if profiling has indicated a critical bottleneck and this is proven to make a significant difference. Moreover, if on some specific platform it did make a significant difference, your compiler optimiser may already be substituting a bitwise operation when it can see that's equivalent (this usually requires that you're /-ing or %-ing by a constant).

For whatever it's worth, on x86 instructions specifically - and when the divisor is a runtime-variable value so can't be trivially optimised into e.g. bit-shifts or bitwise-ANDs, the time taken by / and % operations in CPU cycles can be looked up here. There are too many x86-compatible chips to list here, but as an arbitrary example of recent CPUs - if we take Agner's "Sunny Cove (Ice Lake)" (i.e. 10th gen Intel Core) data, DIV and IDIV instructions have a latency between 12 and 19 cycles, whereas bitwise-AND has 1 cycle. On many older CPUs DIV can be 40-60x worse.

+1 for stressing on readability; usually optimization comes in the last phase and at best we can try not to make our code extremely alien. — legends2k, Dec 05 '13 at 07:04

starblue · Answer 4 · 2013-12-09T08:05:31.857

By default you should use the operation that best expresses your intended meaning, because you should optimize for readable code. (Today most of the time the scarcest resource is the human programmer.)

So use & if you extract bits, and use % if you test for divisibility, i.e. whether the value is even or odd.

For unsigned values both operations have exactly the same effect, and your compiler should be smart enough to replace the division by the corresponding bit operation. If you are worried you can check the assembly code it generates.

Unfortunately integer division is slightly irregular on signed values, as it rounds towards zero and the result of % changes sign depending on the first operand. Bit operations, on the other hand, always round down. So the compiler cannot just replace the division by a simple bit operation. Instead it may either call a routine for integer division, or replace it with bit operations with additional logic to handle the irregularity. This may depends on the optimization level and on which of the operands are constants.

This irregularity at zero may even be a bad thing, because it is a nonlinearity. For example, I recently had a case where we used division on signed values from an ADC, which had to be very fast on an ARM Cortex M0. In this case it was better to replace it with a right shift, both for performance and to get rid of the nonlinearity.

I wonder how much existing code relies upon which aspects of the signed-number behavior of `/` and `%`? The only use I know of for negative results from `%` is for code that wants subtract one from the result of `/` when `%` yields a negative number. In multiple decades of programming I think I've once encountered a case where the symmetry of truncate-toward-zero was actually useful, any many where periodic behavior would have been better. What's especially ironic is that ANSI started mandating truncate-toward-zero around the time when... — supercat, Dec 13 '13 at 22:05
...it probably ceased being the faster behavior in the majority of common cases (since newer processors can perform long multiplications quickly, and floored division by constants can be evaluated using long multiplication more efficiently than truncated division). — supercat, Dec 13 '13 at 22:06
It is unfortunate that, though this is the only correct answer, it appears at the very bottom of the page. Too bad my single +1 won't solve that problem. The "always trust your compiler to do the most optimal thing" manta is not altogether wrong, but it can be very misleading when followed blindly. — Cody Gray - on strike, Jun 01 '16 at 09:25

score 3 · Answer 5 · answered Dec 05 '13 at 07:35

C operators cannot be meaningfully compared in therms of "performance". There's no such thing as "faster" or "slower" operators at language level. Only the resultant compiled machine code can be analyzed for performance. In your specific example the resultant machine code will normally be exactly the same (if we ignore the fact that the first condition includes a postfix increment for some reason), meaning that there won't be any difference in performance whatsoever.

user9164692 · Answer 6 · 2018-01-04T18:30:31.950

1

Here is the compiler (GCC 4.6) generated optimized -O3 code for both options:

int i = 34567;
int opt1 = i++ & 1;
int opt2 = i % 2;

Generated code for opt1:

l     %r1,520(%r11)
nilf  %r1,1
st    %r1,516(%r11)
asi   520(%r11),1

Generated code for opt2:

l     %r1,520(%r11)
nilf  %r1,2147483649
ltr   %r1,%r1
jhe  .L14
ahi   %r1,-1
oilf  %r1,4294967294
ahi   %r1,1
.L14: st %r1,512(%r11)

So 4 extra instructions...which are nothing for a prod environment. This would be a premature optimization and just introduce complexity

edited Jan 04 '18 at 18:30

answered Jan 02 '18 at 23:59

user9164692

19
3

which architecture is this? and did you compile with optimizations on? – phuclv Jan 03 '18 at 01:56
This was generated on IBM z/OS using GCC 4.6 compiler. -O3 optimization level – user9164692 Jan 04 '18 at 18:29
2

I have no idea what those instructions are. They're no way as common as x86 or even PowerPC, and the mnemonics aren't even readable. Besides, counting number of instructions is not a good metric, because a snippet with more instructions might run faster than a shorter one. What's important is how heavy an instruction is, like how are `div`'s throughput and latency compared to shifts. This is not a good answer, lacking explanations and doesn't provide any information to readers – phuclv Jan 05 '18 at 01:08

score 1 · Answer 7 · answered Nov 01 '20 at 02:51

Always these answers about how clever compilers are, that people should not even think about the performance of their code, that they should not dare to question Her Cleverness The Compiler, that bla bla bla… and the result is that people get convinced that every time they use % [SOME POWER OF TWO] the compiler magically converts their code into & ([SOME POWER OF TWO] - 1). This is simply not true. If a shared library has this function:

int modulus (int a, int b) {
    return a % b;
}

and a program launches modulus(135, 16), nowhere in the compiled code there will be any trace of bitwise magic. The reason? The compiler is clever, but it did not have a crystal ball when it compiled the library. It sees a generic modulus calculation with no information whatsoever about the fact that only powers of two will be involved and it leaves it as such.

But you can know if only powers of two will be passed to a function. And if that is the case, the only way to optimize your code is to rewrite your function as

unsigned int modulus_2 (unsigned int a, unsigned int b) {
    return a & (b - 1);
}

The compiler cannot do that for you.

score -4 · Answer 8 · edited Mar 15 '19 at 23:47

-4

Bitwise operations are much faster. This is why the compiler will use bitwise operations for you. Actually, I think it will be faster to implement it as:

~i & 1

Similarly, if you look at the assembly code your compiler generates, you may see things like x ^= x instead of x=0. But (I hope) you are not going to use this in your C++ code.

In summary, do yourself, and whoever will need to maintain your code, a favor. Make your code readable, and let the compiler do these micro optimizations. It will do it better.

edited Mar 15 '19 at 23:47

LJ Germain

467
1
6
14

answered Dec 05 '13 at 07:24

yosim

503
2
8

2

! has higher precedence than &, and Boolean values are numbers equal to zero or one, so `(! i & 1)` is just the same as `!i`. – Potatoswatter Dec 05 '13 at 07:37
The (!i) will invert all bits. the witwise &1 will check if the least significant bit is on. Therefore if i was even, its lease significant bit will be 0, the least significant bit of (!i) will be 1, and therefore performing bitwise (&1) will be true. – yosim Dec 05 '13 at 07:43
You're thinking of `~`, not `!`. – Potatoswatter Dec 05 '13 at 07:52
2

@yosim: !i is bool, returning 0 or 1. It's not bitwise. You may be thinking of ~. – rici Dec 05 '13 at 07:58

Performance wise, how fast are Bitwise Operators vs. Normal Modulus?

8 Answers8

Linked

Related