40

I saw the chosen answer to this post.

I was suprised that (x & 255) == (x % 256) if x is an unsigned integer, I wondered if it makes sense to always replace % with & in x % n for n = 2^a (a = [1, ...]) and x being a positive integer.

Since this is a special case in which I as a human can decide because I know with which values the program will deal with and the compiler does not. Can I gain a significant performance boost if my program uses a lot of modulo operations?

Sure, I could just compile and look at the dissassembly. But this would only answer my question for one compiler/architecture. I would like to know if this is in principle faster.

Community
  • 1
  • 1
Willi Mentzel
  • 27,862
  • 20
  • 113
  • 121
  • 12
    You mean `x & 255 == x % 256`. And for unsigned arithmetic any compiler worth its salt will produce the same code for both. – StoryTeller - Unslander Monica Nov 23 '16 at 08:56
  • Nope. You mean "x % 256". – Werner Henze Nov 23 '16 at 08:56
  • I think to a certain degree, your question is contradicting itself. You assume that checking the output of a compiler *might* only give you insights for **one** platform/architecture; but at the same time you expect that somebody can give you *principle* information here. And that doesnt really match up. Either this is platform dependent, or it is note. – GhostCat Nov 23 '16 at 08:56
  • Without compiler optimization I would think so, since one is a logical bitwise operation and the other requires an integer division which used to be relative slow. Suggestion to measure instead of using this answer blindly. – gast128 Nov 23 '16 at 08:57
  • Judging from [this answer](http://stackoverflow.com/questions/20393373/performance-wise-how-fast-are-bitwise-operator-vs-normal-modulus) I'd say it does make a speed difference, since compilers do it if they can (if n is static). Doesn't say much for how significant it is though – Felk Nov 23 '16 at 08:58
  • 31
    Never write "optimized" code. Write code that reflect what you want to do and compiler optimization will do their job. Writing such code decreases readability, portability and often breaks optimization of compiler – Garf365 Nov 23 '16 at 08:59
  • 1
    @Garf365 I have a soon to be PHD friend who makes DBMS more energy efficient... he says exactly the opposite.. although I have to say that I am on your side! :) – Willi Mentzel Nov 23 '16 at 09:01
  • 13
    Your Phd friend sounds like someone who never had to write code in a group and maintain it for long – StoryTeller - Unslander Monica Nov 23 '16 at 09:03
  • 2
    @garf365 for _most_ applications, that rule is true. For applications with performance concerns, optimization is a higher priority though and it could be considered more important than that extra 5% readability. It's just a rule of thumb afterall – Felk Nov 23 '16 at 09:04
  • 11
    @Felk, another rule of thumb is to get correct code first, and do benchmarks with optimizations later. It's harder when the code isn't clear. – StoryTeller - Unslander Monica Nov 23 '16 at 09:05
  • 2
    @Garf365: this is a good general guideline, but if speed matters one can sacrifice readability for performance. Suggestion not to scatter this in code then, but use a readable inline function like 'size_t ModuloFast(size_t n)' – gast128 Nov 23 '16 at 09:06
  • @Felk, gast128, you're right, but this kind of applications is a really special case, in which proability, maintenability and readability come after performance. And in most case only a small part of complete soft has to be optimized, so only a small part of code has to be "optimized", embeded in well named function (as gast128 as said) and after benchmark (as said by StoryTeller) – Garf365 Nov 23 '16 at 09:09
  • 1
    @StoryTeller sounds like you know him!! :D – Willi Mentzel Nov 23 '16 at 09:09
  • 3
    More like I know the type :) Brilliant people, but they seem to misunderstand collaboration in software development is just as important as writing kickass code. – StoryTeller - Unslander Monica Nov 23 '16 at 09:10
  • 1
    @StoryTeller sad but true... it is easier to read his dissassembly than to read his actual c++ code... xD but he is a genius though :) – Willi Mentzel Nov 23 '16 at 09:13
  • 4
    As you saw, this should make no difference for performance. Personally I usually prefer the bitwise AND, for me it's easier to immediately visualize bitstring being truncated than thinking about modular arithmetic and remembering that remainder by a power of two is the special nice case that I can then again visualize as a bitstring being truncated. – harold Nov 23 '16 at 10:37
  • How likely is it that you have a system where you do a lot of divisions with an integer constant that's a multiple of 8? That's the only case where this makes sense. It would seem to me that divisions with a variable value would be far more common. – Lundin Nov 23 '16 at 10:49
  • Sometimes you don’t want to rely on the compiler doing the optimization, but be absolutely sure that the code has the right shape. Besides, while I definitely second that readability is more important than performance, I have to ask why `x % 256` is deemed more readable than `x & 255`. All I see, is a variable, a programming language specific operator, and a constant. Neither `%` nor `&` are common mathematical operators outside the programming language’s world. Whether the reader understands either depends on the reader’s knowledge about the programming language. – Holger Nov 23 '16 at 11:09
  • @Holger I see your point, but since & is far more universal and powerful than %, I bet most people know what % will do but not what & will... :( – Willi Mentzel Nov 23 '16 at 11:10
  • 2
    Well, I think programming with care for the hypothetical least educated reader leads to nowhere. There is a difference between code relying on some tricky side effect and code using functions or operators, which might be lesser known than others, right in the intended way. If you stumble over the `&` operator and don’t know what it does, you can read about it and every documentation there is will be sufficient for understanding the meaning of `x & 255`. This is not a special corner case of the `&` operator. – Holger Nov 23 '16 at 11:23
  • I’ve also seen the opposite, where the reader had to understand that `%` or `/` with a power of two is actually meaning masking or right shifting… – Holger Nov 23 '16 at 11:23
  • *it is easier to read his dissassembly than to read his actual c++ code*: if that was actually true for you, you would already know that [AND is much faster than DIV](http://stackoverflow.com/questions/40354978/why-is-this-c-code-faster-than-my-hand-written-assembly-for-testing-the-collat/40355466#40355466), and that any compiler worth its salt would try hard to avoid using an actual DIV instruction for `%` or `/` operations. :P (% or / by a compile-time constant compiles to a shift or AND, or for non-power-of-2 to a modular multiplicative inverse). – Peter Cordes Nov 23 '16 at 12:52
  • 1
    @PeterCordes I never said that it is true for me and this was a joke.. no reason to be mean – Willi Mentzel Nov 23 '16 at 12:56
  • I was also trying to be funny, sorry it didn't work out :/ But mostly I just wanted to post a link where I explained *exactly* how slow a DIV instruction is on Intel Haswell, compared to AND or SHR. Not necessarily relevant here, since an important part of this question is whether the compiler needs this help, not what's fast in asm. – Peter Cordes Nov 23 '16 at 12:59
  • 1
    You should change `x & 255 == x % 256` to `(x & 255) == x % 256` because `x & 255 == x % 256` means `x & (255 == x % 256)`. – v7d8dpo4 Nov 23 '16 at 13:04
  • Note that intent of the code is important: if you need to get mathematical result, then prefer `%`, but if you need to have 8 lowest bits, then use always `&`. – user694733 Nov 23 '16 at 13:19

2 Answers2

46

If your integral type is unsigned, the compiler will optimize it, and the result will be the same. If it's signed, something is different...

This program:

int mod_signed(int i) {
  return i % 256;
}
int and_signed(int i) {
  return i & 255;
}
unsigned mod_unsigned(unsigned int i) {
  return i % 256;
}
unsigned and_unsigned(unsigned int i) {
  return i & 255;
}

will be compiled (by GCC 6.2 with -O3; Clang 3.9 produces very similar code) into:

mod_signed(int):
        mov     edx, edi
        sar     edx, 31
        shr     edx, 24
        lea     eax, [rdi+rdx]
        movzx   eax, al
        sub     eax, edx
        ret
and_signed(int):
        movzx   eax, dil
        ret
mod_unsigned(unsigned int):
        movzx   eax, dil
        ret
and_unsigned(unsigned int):
        movzx   eax, dil
        ret

The result assembly of mod_signed is different because

If both operands to a multiplication, division, or modulus expression have the same sign, the result is positive. Otherwise, the result is negative. The result of a modulus operation's sign is implementation-defined.

and AFAICT, most of implementation decided that the result of a modulus expression is always the same as the sign of the first operand. See this documentation.

Hence, mod_signed is optimized to (from nwellnhof's comment):

int d = i < 0 ? 255 : 0;
return ((i + d) & 255) - d;

Logically, we can prove that i % 256 == i & 255 for all unsigned integers, hence, we can trust the compiler to do its job.

Community
  • 1
  • 1
Danh
  • 5,916
  • 7
  • 30
  • 45
  • 3
    If it's **signed**, something is difference. – Jonas Nov 23 '16 at 09:01
  • 2
    The compiler can't optimize it with `n` not being static though – Felk Nov 23 '16 at 09:02
  • 2
    @Jonas Fixed that typo – Danh Nov 23 '16 at 09:03
  • 1
    @Felk If we don't know it statically, we can't change it to bitwise and though! If we choose to compare and branching, I don't think it's good – Danh Nov 23 '16 at 09:03
  • @Danh yes, but OP said in his case n will be suitable for this optimization at runtime – Felk Nov 23 '16 at 09:08
  • Even if this optimization would not happen, I would consider the CARP rule (the computer is really powerful). If you don't know the speed differences between an & and a %, then you shouldn't really switch one for the other. – Nonanon Nov 23 '16 at 09:27
  • MSVC also makes the optimizations as depicted here. (Although it does emit slightly different code for `mod_signed`, it is logically equivalent and would have no significant differences in performance.) I point this out because MSVC's output is not available through Godbolt's Compiler Explorer. – Cody Gray - on strike Nov 23 '16 at 12:10
  • @felk I don't understand what you are trying to say. First of all, I don't see the identifier `n` anywhere here, so I assume you mean the input to the function, `i`. And `i` absolutely does not have to be "static" (by which I suppose you mean a compile-time constant?) in order for the compiler to optimize it. Danh has shown the object code that the compiler will emit for these four functions. If they get inlined, then that code will be inlined. There may be *further* optimizations made in that case if the function input is a compile-time constant, but you'll *always* get these transformations. – Cody Gray - on strike Nov 23 '16 at 12:14
  • I did some [further testing](https://godbolt.org/g/ocAjsY) based on your example, and it looks like both GCC and Clang are also smart enough to optimize `i % (1 << k)` into the equivalent of `i & ((1 << k) - 1)` if `i` is unsigned. For signed `i`, however, both fall back to using `idiv` for `%` even if the modulus is guaranteed to be a power of 2! – Ilmari Karonen Nov 23 '16 at 14:05
  • 2
    For anyone who's curious, the code generated for `mod_signed` is equivalent to: `int d = i < 0 ? 255 : 0; return ((i + d) & 255) - d;` – nwellnhof Nov 23 '16 at 14:49
2

I did some measurements with gcc, and if the argument of a / or % is a compiled time constant that's a power of 2, gcc can turn it into the corresponding bit operation.

Here are some of my benchmarks for divisions What has a better performance: multiplication or division? and as you can see, the running times with divisors that are statically known powers of two are noticably lower than with other statically known divisors.

So if / and % with statically known power-of-two arguments describe your algorithm better than bit ops, feel free to prefer / and %. You shouldn't lose any performance with a decent compiler.

Community
  • 1
  • 1
Petr Skocik
  • 58,047
  • 6
  • 95
  • 142
  • 3
    For / and %, this is only true with unsigned. You won't lose *much* performance, but implementing signed division by a constant power of 2 takes more than just an arithmetic right shift (to handle the difference in rounding for negative numbers). Similarly, it takes extra work to use a modular multiplicative inverse for div or mod by a non-power-of-2 on a signed integer. – Peter Cordes Nov 23 '16 at 12:57
  • @PeterCordes however the compiler is clever enough to know that "(a % b + b) % b" and can be turned into a bit operation. I haven't found an optimizable expression that can turn division into shifting though. – Random832 Nov 23 '16 at 14:14
  • @Random832: I meant for division by a compile-time constant. – Peter Cordes Nov 23 '16 at 22:06