46

Consider next code:

unsigned idx;
//.. some work with idx
if( idx >= idx_max )
    idx %= idx_max;

Could be simplified to only second line:

idx %= idx_max;

and will achieve the same result.


Several times I met next code:

unsigned x;
//... some work with x
if( x!=0 )
  x=0;

Could be simplified to

x=0;

The questions:

  • Is there any sense to use if and why? Especially with ARM Thumb instruction set.
  • Could these ifs be omited?
  • What optimization does compiler?
kyb
  • 7,233
  • 5
  • 52
  • 105

4 Answers4

66

If you want to understand what the compiler is doing, you'll need to just pull up some assembly. I recommend this site (I already entered code from the question)): https://godbolt.org/g/FwZZOb.

The first example is more interesting.

int div(unsigned int num, unsigned int num2) {
    if( num >= num2 ) return num % num2;
    return num;
}

int div2(unsigned int num, unsigned int num2) {
    return num % num2;
}

Generates:

div(unsigned int, unsigned int):          # @div(unsigned int, unsigned int)
        mov     eax, edi
        cmp     eax, esi
        jb      .LBB0_2
        xor     edx, edx
        div     esi
        mov     eax, edx
.LBB0_2:
        ret

div2(unsigned int, unsigned int):         # @div2(unsigned int, unsigned int)
        xor     edx, edx
        mov     eax, edi
        div     esi
        mov     eax, edx
        ret

Basically, the compiler will not optimize away the branch, for very specific and logical reasons. If integer division was about the same cost as comparison, then the branch would be pretty pointless. But integer division (which modulus is performed together with typically) is actually very expensive: http://www.agner.org/optimize/instruction_tables.pdf. The numbers vary greatly by architecture and integer size but it typically could be a latency of anywhere from 15 to close to 100 cycles.

By taking a branch before performing the modulus, you can actually save yourself a lot of work. Notice though: the compiler also does not transform the code without a branch into a branch at the assembly level. That's because the branch has a downside too: if the modulus ends up being necessary anyway, you just wasted a bit of time.

There's no way to make a reasonable determination about the correct optimization without knowing the relative frequency with which idx < idx_max will be true. So the compilers (gcc and clang do the same thing) opt to map the code in a relatively transparent way, leaving this choice in the hands of the developer.

So that branch might have been a very reasonable choice.

The second branch should be completely pointless, because comparison and assignment are of comparable cost. That said, you can see in the link that compilers will still not perform this optimization if they have a reference to the variable. If the value is a local variable (as in your demonstrated code) then the compiler will optimize the branch away.

In sum the first piece of code is perhaps a reasonable optimization, the second, probably just a tired programmer.

Nir Friedman
  • 17,108
  • 2
  • 44
  • 72
  • And ARM-32 that OP is using has skip instructions (though not in thumb mode IIRC) which will mean that there is no branch prediction failure – Antti Haapala -- Слава Україні May 02 '17 at 07:46
  • @AnttiHaapala Taking your word on that, I'm not familiar with ARM at all, clearly you are. Feel free to edit my answer as well. – Nir Friedman May 02 '17 at 08:20
  • @kyb No problem. If you think the answer resolves things for you, feel free to mark it accepted so people know this question is resolved. Cheers. – Nir Friedman May 02 '17 at 09:23
  • 2
    @NirFriedman , your answer clarifies even more than I could hope. Thank you ones more for your time. – kyb May 02 '17 at 09:38
  • 2
    @NirFriedman: Note that you changed the meaning of original post, and so you added possible bugs. Original question used "unsigned", you used "signed int", so div and div2 are different functions: different results with negative numbers. – Giacomo Catenazzi May 02 '17 at 11:31
  • @GiacomoCatenazzi Sure, good point, changing the types to unsigned does not change assembly in any significant way though. – Nir Friedman May 02 '17 at 11:48
  • Note that many ARM architectures don't include a divide instruction, and rely on compiler-generated (and hopefully highly optimised) instruction sequences. – Jeremy May 02 '17 at 11:56
  • Nevertheless, I think you should edit the question to use the correct types; your argument that "the compiler doesn't elide the branch, this must be related to optimization" doesn't actually make sense when you're showing code that's *not equivalent* with the branch elided. Also, I'm not quite sold on the argument that the branch is likely to waste time if *inserted* as an optimization; don't compilers targeting modern architectures typically assume that branch prediction will be pretty good? – Kyle Strand May 02 '17 at 16:57
  • @KyleStrand Very well, wish granted. There's not much to argue about. If you do have a branch, and then the modulus ends up being necessary anyway, then now you computed a branch *and* did the modulus. That is strictly slower than just doing the modulus. The point is that you are trading performance in one case, for performance in another; it's not a strict win. Whether it's a win on average to do the branch depends on the distribution. And many people don't care about the average, they just want the best worst case. For those people the branch makes no sense. – Nir Friedman May 02 '17 at 17:09
  • 1
    Fair enough. Thank you. – Kyle Strand May 02 '17 at 17:10
  • Branch misprediction is still more costly than division in general, so if the branch is not well-predicted, this is a pessimization. – Cody Gray - on strike May 03 '17 at 07:59
  • @CodyGray Citation needed. – Nir Friedman May 03 '17 at 08:00
  • Agner Fog's resources that you cite are a perfectly adequate source. As is any benchmark you could run. I've run plenty of them with real-world code. – Cody Gray - on strike May 03 '17 at 08:01
  • 1
    @CodyGray There's literally dozens if not hundreds of lines that have numbers we could compare. Doing the arithmetic from this answer: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array showed about a 7.5 cycle difference between perfect prediction and random (50-50) prediction (i.e. 15 cycles for a miss). For every architecture I've seen listed, integer division costs significantly more than 15 cycles. – Nir Friedman May 03 '17 at 08:09
  • Lots of things come into play, and microbenchmarks are often rather misleading. For one thing, there's the difference between latency and throughput that you seem to be ignoring in the numbers that you're citing. Also, there's the fact that in real-world code, you aren't doing just a single division (or branch-and-divide), and then calling it quits. There are other instructions that can be interleaved throughout, or even additional divisions, and that means the latency is essentially free. Not so for a branch misprediction. Also, even correctly-predicted branches have a cost, consuming... – Cody Gray - on strike May 03 '17 at 08:19
  • ...resources and decreasing throughput. Haswell has an 18-20 cycle branch misprediction penalty, but that's the only number I know off the top of my head, and simple cycle counting hasn't been meaningful since the 8086 days. I really didn't want to write up a complete answer to this, nor did I want to have to think through how to write a good benchmark to demonstrate it. I just left a comment stating what I know to be a fact, based on extensive testing and understanding of the microarchitecture. You are free to disagree with me, of course. – Cody Gray - on strike May 03 '17 at 08:21
  • @CodyGray So basically, your response is reference to authority, with yourself as the authority, no less? Whereas here: https://youtu.be/nXaxk27zwlk?t=1h13m8s, the head of one the most important C++ compiler projects on earth, giving a keynote at the most prolific C++ conference on earth, says unequivocally that the branch is faster, even at 50-50 branch distribution? So either you, or Chandler Carruth messed up the benchmarks. His code is out in the open, you aren't willing to share yours... I don't really know what else to say. – Nir Friedman May 03 '17 at 08:34
6

There are a number of situations where writing a variable with a value it already holds may be slower than reading it, finding out already holds the desired value, and skipping the write. Some systems have a processor cache which sends all write requests to memory immediately. While such designs aren't commonplace today, they used to be quite common since they can offer a substantial fraction of the performance boost that full read/write caching can offer, but at a small fraction of the cost.

Code like the above can also be relevant in some multi-CPU situations. The most common such situation would be when code running simultaneously on two or more CPU cores will be repeatedly hitting the variable. In a multi-core caching system with a strong memory model, a core that wants to write a variable must first negotiate with other cores to acquire exclusive ownership of the cache line containing it, and must then negotiate again to relinquish such control the next time any other core wants to read or write it. Such operations are apt to be very expensive, and the costs will have to be borne even if every write is simply storing the value the storage already held. If the location becomes zero and is never written again, however, both cores can hold the cache line simultaneously for non-exclusive read-only access and never have to negotiate further for it.

In almost all situations where multiple CPUs could be hitting a variable, the variable should at minimum be declared volatile. The one exception, which might be applicable here, would be in cases where all writes to a variable that occur after the start of main() will store the same value, and code would behave correctly whether or not any store by one CPU was visible in another. If doing some operation multiple times would be wasteful but otherwise harmless, and the purpose of the variable is to say whether it needs to be done, then many implementations may be able to generate better code without the volatile qualifier than with, provided that they don't try to improve efficiency by making the write unconditional.

Incidentally, if the object were accessed via pointer, there would be another possible reason for the above code: if a function is designed to accept either a const object where a certain field is zero, or a non-const object which should have that field set to zero, code like the above might be necessary to ensure defined behavior in both cases.

supercat
  • 77,689
  • 9
  • 166
  • 211
  • And then there are load-linked conditional stores, which unlike compare-exchange, will fail following a "useless" write. – Ben Voigt May 02 '17 at 21:49
  • @BenVoigt: Load-linked stores offer a stronger guarantee than compare-exchange or so-called compare-and-swap in cases where it succeeds, but is generally allowed to fail arbitrarily so long as there are at least some useful cases where repeated retries are likely to work. In many cases, the usable semantics end up being comparable to cx/cas. – supercat May 02 '17 at 22:17
  • But in the context of this question, the useless write can be expected to break the load-link, and fail the conditional store. Just offering one more example of "situations where writing a variable with a value it already holds may be slower than reading it, finding out already holds the desired value, and skipping the write". Although to be pedantic, while the LL/SC case is merely slowed down by the useless write if there's a retry loop, in the most general case it can be an altogether different eventual result. – Ben Voigt May 02 '17 at 22:23
  • Presumably hardware breakpoints would be another example where the useless write matters greatly. – Ben Voigt May 02 '17 at 22:24
  • @BenVoigt: If hardware is attached to a memory address, the `volatile` qualifier would often be needed for any kind of useful semantics with any lvalues other than pointer-converted constants (while the Standard may allow an implementation to assume that operations on `(*(uint32_t*)0xF0001208)` can be consolidated absent a `volatile` qualifier, enough hardware vendors omit `volatile` qualifiers from their headers that such "optimizations" should by default considered ill-advised in most cases. – supercat May 02 '17 at 22:58
  • Agreed, unless `volatile` is in effect, the "useless" write might not actually happen. – Ben Voigt May 02 '17 at 23:08
  • 2
    "In almost all situations where multiple CPUs could be hitting a variable, the variable should at minimum be declared volatile" I think this is completely incorrect: volatile should *only* be used when said variable can be written to by a non C++ process. – Nir Friedman May 03 '17 at 01:32
  • @NirFriedman: I said *at least*. Quality implementations of newer standards for multi-core machines will support proper atomic types which should be used instead of `volatile`, but some implementations which predate those standards process `volatile` in a way that would achieve the required semantics. Absent `volatile`, it would be unlikely that any implementation would achieve required semantics *unless* the only consequence of seeing old rather than new data would be the performance of some redundant work. I would suggest that in applications where extra work would be... – supercat May 03 '17 at 03:03
  • ...the only consequence of having code see old data, a quality implementation should not force a programmer to do redundant work to avoid more severe consequences. Unfortunately, it is fashionable for compilers to assume UB doesn't mean "Do what makes sense for an impleentation's target and purpose", but rather "Assume code will never receive input that would invoke UB, whether or not such an assumption would make sense". – supercat May 03 '17 at 03:05
  • @supercat I can't argue with you about "some implementations" because there are so many architectures and implementations out there I have no doubt there are some for which you are correct. But you did not restrict yourself to a specific set of implementations. As general advice, this is still poor because volatile is not required (and on many/most platforms, has never) to give anything in the way of thread safety. Many older implementations, volatile is not thread safe, but there are third party thread safety libraries, which you should be using. – Nir Friedman May 03 '17 at 03:11
  • As for what compilers assume, without getting into an argument over whether it's unfortunate or not, the reality is that they do perform crazy optimizations on UB, so telling people to do things that may result in UB (like, trying to use volatile in multi-threaded situations) is not good advice. – Nir Friedman May 03 '17 at 03:12
  • Declaring the variable as `volatile` doesn't fix a data race if there is one; and it's unnecessary if there isn't a data race. To solve data races you can make the variable `_Atomic`, or use synch objects etc. – M.M May 03 '17 at 04:58
  • @M.M What happened to your answer? I went to look at the comments to see if it had been resolved whether removing the branch in the second case is technically an illegal optimization, and now it seems to be gone? Do you know whether the optimization is legal or not? I was considering creating a question. – Nir Friedman May 03 '17 at 07:33
  • @NirFriedman I think my answer is mostly wrong , and yours and supercat's cover the main points. Maybe a separate question specifically for case 2 in a multithreaded environment would be good – M.M May 03 '17 at 07:45
  • @M.M: The newer atomic types should be used on compilers that support them. If one is targeting a compiler that doesn't support such types, either `volatile` will work, or it will be impossible to make anything work without system-dependent code. – supercat May 03 '17 at 14:36
2

Regards first block of code: this is a micro-optimization based on Chandler Carruth's recommendations for Clang (see here for more info), however it doesn't necessarily hold that it would be a valid micro-optimization in this form (using if rather than ternary) or on any given compiler.

Modulo is a reasonably expensive operation, if the code is being executed often and there is a strong statistical lean to one side or the other of the conditional, the CPU's branch prediction (given a modern CPU) will significantly reduce the cost of the branch instruction.

Community
  • 1
  • 1
metamorphosis
  • 1,972
  • 16
  • 25
1

It seems a bad idea to use the if there, to me.

You are right. Whether or not idx >= idx_max, it will be under idx_max after idx %= idx_max. If idx < idx_max, it will be unchanged, whether the if is followed or not.

While you might think branching around the modulo might save time, real culprit, I'd say, is that when branches are followed, pipelining modern CPU's have to reset their pipeline, and that costs a relative lot of time. Better not to have to follow a branch, than do an integer modulo, which costs roughly as much time as an integer division.

EDIT: It turns out that the modulus is pretty slow vs. the branch, as suggested by others here. Here's a guy examining this exact same question: CppCon 2015: Chandler Carruth "Tuning C++: Benchmarks, and CPUs, and Compilers! Oh My!" (suggested in another SO question linked to in another answer to this question).

This guy writes compilers, and thought it would be faster without the branch; but his benchmarks proved him wrong. Even when the branch was taken only 20% of the time, it tested faster.

Another reason not to have the if: One less line of code to maintain, and for someone else to puzzle out what it means. The guy in the above link actually created a "faster modulus" macro. IMHO, this or an inline function is the way to go for performance-critical applications, because your code will be ever so much more understandable without the branch, but will execute as fast.

Finally, the guy in the above video is planning to make this optimization known to compiler writers. Thus, the if will probably be added for you, if not in the code. Hence, just the mod alone will do, when this comes about.

CodeLurker
  • 1,248
  • 13
  • 22
  • However, cache contention can be far more expensive than a pipeline flush. – Ben Voigt May 02 '17 at 06:48
  • An integer modulo, where bother inputs are not known at compile time, is staggeringly expensive. Depends on the exact architecture as always, but it could be over 100 cycles. I'm having trouble believing that a simple branch can be that expensive. – Nir Friedman May 02 '17 at 07:01
  • @NirFriedman It *could* be over 100 cycles, but according to [Intel](https://software.intel.com/sites/default/files/managed/9e/bc/64-ia-32-architectures-optimization-manual.pdf), that would need 64-bit values on a Silvermont microarchitecture (38-123 cycles latency). Goldmont microarchitecture on 32-bit values is 12-25 cycles latency (and Silvermont on 32-bits is only 26-38). I think that is easily within the range of a branch misprediction. I don't know about Arm. – Martin Bonner supports Monica May 02 '17 at 07:44
  • @MartinBonner That's a fair point, and thanks for clarifying (I was not doing a good job reading some of those tables) but a) the average values for integer division are still quite a bit higher, and b) you expect the branch predictor to win at least half the time, right? You'd hope ;-). I'm also not sure if for an ultra short branch, you will pay the full penalty. After all, all the subsequent code after just one line is identical. Overall I would say this situation probably calls for a carefully crafted micro-benchmark. – Nir Friedman May 02 '17 at 08:33
  • The branch predictor *can* be wrong every single time, but in reality "about half" is typical worst case. The problem with a short branch miss is that you don't just have to unwind the single wrong instruction, you have to unwind the several instructions after that which have also started executing. See the classic "why is processing a sorted vector faster" answer. – Martin Bonner supports Monica May 02 '17 at 10:16
  • @MartinBonner Interesting, I did the arithmetic from that question and came up with the fact that the sorted data was about 3 ns faster per iteration. Multiplying by 2.5 Ghz for the clock speed, we conclude that perfect prediction vs random prediction is about 7.5 cycles, or that a miss is worth about 15 cycles. It's definitely closer than I thought, at least on some architectures. I might have to run that micro benchmark and update my answer. – Nir Friedman May 02 '17 at 15:42