Avoid stalling pipeline by calculating conditional early

Question

When talking about the performance of ifs, we usually talk about how mispredictions can stall the pipeline. The recommended solutions I see are:

Trust the branch predictor for conditions that usually have one result; or
Avoid branching with a little bit of bit-magic if reasonably possible; or
Conditional moves where possible.

What I couldn't find was whether or not we can calculate the condition early to help where possible. So, instead of:

... work
if (a > b) {
    ... more work
}

Do something like this:

bool aGreaterThanB = a > b;
... work
if (aGreaterThanB) {
    ... more work
}

Could something like this potentially avoid stalls on this conditional altogether (depending on the length of the pipeline and the amount of work we can put between the bool and the if)? It doesn't have to be as I've written it, but is there a way to evaluate conditionals early so the CPU doesn't have to try and predict branches?

Also, if that helps, is it something a compiler is likely to do anyway?

@MitchWheat -- I don't see how "values are not known until run time" relates to the question. It is my understanding that by the time the conditional is evaluated, the CPU has guessed what comes next, which may or may not be correct. What I'm wondering is if there's a way to calculate that conditional early so that the CPU doesn't have to guess, although I suppose I haven't asked the question very clearly. EDIT: I've edited the question to make my intent more clear — Jibb Smart, Apr 20 '18 at 00:54
@BenVoigt -- Gotcha. That makes sense. If you made your comments into an answer (and given enough time for other people also more knowledgable than I am in this area to challenge it if need be), I'll accept it. You've answered the question, and your comments have more than enough info to qualify for an answer, IMHO. Thanks! — Jibb Smart, Apr 20 '18 at 01:03
There is [a nice paper from MICRO-45](https://people.engr.ncsu.edu/ericro/publications/conference_MICRO-45.pdf) that tries to answer your exact question. They find about 38% of conditional branches from their selection of benchmarks could take advantage of early evaluation (decoupling). It does require ISA modifications however. — hayesti, Jun 01 '18 at 17:20
@hayesti Wow, that's very cool! That answers the question really well. — Jibb Smart, Jun 02 '18 at 02:45

BeeOnRope · Accepted Answer · 2018-04-24T23:53:31.957

Yes, it can be beneficial to allow the the branch condition to calculated as early as possible, so that any misprediction can be resolved early and the front-end part of the pipeline can start re-filling early. In the best case, the mis-prediction can be free if there is enough work already in-flight to totally hide the front end bubble.

Unfortunately, on out-of-order CPUs, early has a somewhat subtle definition and so getting the branch to resolve early isn't as simple as just moving lines around in the source - you'll probably have to make a change to way the condition is calculated.

What doesn't work

Unfortunately, earlier doesn't refer to the position of the condition/branch in the source file, nor does it refer to the positions of the assembly instructions corresponding to the comparison or branch. So at a fundamental level, it mostly⁷ doesn't work as in your example.

Even if source level positioning mattered, it probably wouldn't work in your example because:

You've moved the evaluation of the condition up and assigned it to a bool, but it's not the test (the < operator) that can mispredict, it's the subsequent conditional branch: after all, it's a branch misprediction. In your example, the branch is in the same place in both places: its form has simply changed from if (a > b) to if (aGreaterThanB).

Beyond that, the way you've transformed the code isn't likely to fool most compilers. Optimizing compilers don't emit code line-by-line in the order you've written it, but rather schedule things as they see fit based on the source-level dependencies. Pulling the condition up earlier will likely just be ignored, since compilers will want to put the check where it would naturally go: approximately right before the branch on architectures with a flag register.

For example, consider the following two implementations of a simple function, which follow the pattern you suggested. The second function moves the condition up to the top of the function.

int test1(int a, int b) {
    int result = a * b;
    result *= result;
    if (a > b) {
        return result + a;
    }
    return result + b * 3;
}

int test2(int a, int b) {
    bool aGreaterThanB = a > b;
    int result = a * b;
    result *= result;
    if (aGreaterThanB) {
        return result + a;
    }
    return result + b * 3;
}

I checked gcc, clang² and MSVC, and all compiled both functions identically (the output differed between compilers, but for each compiler the output was the same for the two functions). For example, compiling test2 with gcc resulted in:

test2(int, int):
  mov eax, edi
  imul eax, esi
  imul eax, eax
  cmp edi, esi
  jg .L4
  lea edi, [rsi+rsi*2]
.L4:
  add eax, edi
  ret

The cmp instruction corresponds to the a > b condition, and gcc has moved it back down past all the "work" and put it right next to the jg which is the conditional branch.

What does work

So if we know that simple manipulation of the order of operations in the source doesn't work, what does work? As it turns out, anything you can do move the branch condition "up" in the data flow graph might improve performance by allowing the misprediction to be resolved earlier. I'm not going to get deep into how modern CPUs depend on dataflow, but you can find a brief overview here with pointers to further reading at the end.

Traversing a linked list

Here's a real-world example involving linked-list traversal.

Consider the the task of summing all values a null-terminated linked list which also stores its length¹ as a member of the list head structure. The linked list implemented as one list_head object and zero or more list nodes (with a single int value payload), defined like so:

struct list_node {
    int value;
    list_node* next;
};

struct list_head {
    int size;
    list_node *first;
};

The canonical search loop would use the node->next == nullptr sentinel in the last node to determine that is has reached the end of the list, like this:

long sum_sentinel(list_head list) {
    int sum = 0;
    for (list_node* cur = list.first; cur; cur = cur->next) {
        sum += cur->value;
    }
    return sum;
}

That's about as simple as you get.

However, this puts the branch that ends the summation (the one that first cur == null) at the end of the node-to-node pointer-chasing, which is the longest dependency in the data flow graph. If this branch mispredicts, the resolution of the mispredict will occur "late" and the front-end bubble will add directly to the runtime.

On the other hand, you could do the summation by explicitly counting nodes, like so:

long sum_counter(list_head list) {
    int sum = 0;
    list_node* cur = list.first;
    for (int i = 0; i < list.size; cur = cur->next, i++) {
        sum += cur->value;
    }
    return sum;
}

Comparing this to the sentinel solution, it seems like we have added extra work: we now need to initialize, track and decrement the count⁴. The key, however, is that this decrement dependency chain is very short and so it will "run ahead" of pointer-chasing work and the mis-prediction will occur early while there is still valid remaining pointer chasing work to do, possibly with a large improvement in runtime.

Let's actually try this. First we examine the assembly for the two solutions, so we can verify there isn't anything unexpected going on:

<sum_sentinel(list_head)>:
test   rsi,rsi
je     1fe <sum_sentinel(list_head)+0x1e>
xor    eax,eax
loop:
add    eax,DWORD PTR [rsi]
mov    rsi,QWORD PTR [rsi+0x8]
test   rsi,rsi
jne    loop
cdqe   
ret    


<sum_counter(list_head)>:
test   edi,edi
jle    1d0 <sum_counter(list_head)+0x20>
xor    edx,edx
xor    eax,eax
loop:
add    edx,0x1
add    eax,DWORD PTR [rsi]
mov    rsi,QWORD PTR [rsi+0x8]
cmp    edi,edx
jne    loop:
cdqe   
ret

As expected, the sentinel approach is slightly simpler: one less instruction during setup, and one less instruction in the loop⁵, but overall the key pointer chasing and addition steps are identical and we expect this loop to be dominated by the latency of successive node pointers.

Indeed, the loops perform virtually identically when summing short or long lists when the prediction impact is negligible. For long lists the branch prediction impact is automatically small since the single mis-prediction when the end of the list is reached is amortized across many nodes, and the runtime asymptotically reaches almost exactly 4 cycles per node for lists contained in L1, which is what we expect with Intel's best-case 4 cycle load-to-use latency.

For short lists, branch misprediction is neglible if the pattern of lists is predictable: either always the same or cycling with some moderate period (which can be 1000 or more with good prediction!). In this case the time per node can be less than 4 cycles when summing many short lists since multiple lists can be in flight at once (e.g., if summary an array of lists). In any case, both implementations perform almost identically. For example, when lists always have 5 nodes, the time to sum one list is about 12 cycles with either implementation:

** Running benchmark group Tests written in C++ **
                     Benchmark   Cycles   BR_MIS
       Linked-list w/ Sentinel    12.19     0.00
          Linked-list w/ count    12.40     0.00

Let's add branch prediction to the mix, by changing the list generation code to create lists with an average a length of 5, but with actual length uniformly distributed in [0, 10]. The summation code is unchanged: only the input differs. The results with random list lengths:

** Running benchmark group Tests written in C++ **
                     Benchmark   Cycles   BR_MIS
       Linked-list w/ Sentinel    43.87     0.88
          Linked-list w/ count    27.48     0.89

The BR_MIS column shows that we get nearly one branch misprediction per list⁶, as expected, since the loop exit is unpredictable.

However, the sentinel algorithm now takes ~44 cycles versus the ~27.5 cycles of the count algorithm. The count algorithm is about 16.5 cycles faster. You can play with the list lengths and other factors, and change the absolute timings, but the delta is almost always around 16-17 cycles, which not coincidentally is about the same as the branch misprediction penalty on recent Intel! By resolving the branch condition early, we are avoiding the front end bubble, where nothing would be happening at all.

Calculating iteration count ahead of time

Another example would be something like a loop which calculates a floating point value, say a Taylor series approximation, where the termination condition depends on some function of the calculated value. This has the same effect as above: the termination condition depends on the slow loop carried dependency, so it is just as slow to resolve as the calculation of the value itself. If the exit is unpredictable, you'll suffer a stall on exit.

If you could change that to calculate the iteration count up front, you could use a decoupled integer counter as the termination condition, avoiding the bubble. Even if the up-front calculation adds some time, it could still provide an overall speedup (and the calculation can run in parallel with the first iterations of the loop, anyways, so it may be much less costly what you'd expect by looking at its latency).

¹ MIPS is an interesting exception here having no flag registers - test results are stored directly into general purpose registers.

² Clang compiled this and many other variants in a branch-free manner, but it still interesting because you still have the same structure of a test instruction and a conditional move (taking the place of the branch).

³ Like the C++11 std::list.

⁴ As it turns out, on x86, the per-node work is actually very similar between the two approaches because of the way that dec implicitly set the zero flag, so we don't need an extra test instruction, while the mov used in pointer chasing doesn't, so the counter approach has an extra dec while the sentinel approach has an extra test, making it about a wash.

⁵ Although this part is just because gcc didn't manage to transform the incrementing for-loop to a decrementing one to take advantage of dec setting the zero flag, avoiding the cmp. Maybe newer gcc versions do better. See also footnote 4.

⁶ I guess this is closer to 0.9 than to 1.0 since perhaps the branch predictors still get the length = 10 case correct, since once you've looped 9 times the next iteration will always exit. A less synthetic/exact distribution wouldn't exhibit that.

⁷ I say mostly because in some cases you might save a cycle or two via such source or assembly-level re-orderings, because such things can have a minor effect on the execution order in out-of-order processors, execution order is also affected by assembly order, but only within the constraints of the data-flow graph. See also this comment.

Did gcc intentionally place `add edx,0x1` in `sum_counter` at that location? I mean, does it try to place the condition of the branch far from the branch? The body of the loop of `sum_counter` is small, the processor might decode all of its instructions together, do it may make a prediction before executing `add edx,0x1`. How do we know that the `sum_counter` is faster than the other function because the condition is calculated early and not because the condition is much faster to calculate? The branch condition in `sum_sentinel` is dependent on a memory access. — Hadi Brais, Apr 24 '18 at 15:43
What do you mean by "Let's add branch prediction to the mix."? What does the code look like? — Hadi Brais, Apr 24 '18 at 15:44
@haidi - sorry for the confusion, the code under test didn't change from the above examples, only the lengths of the linked lists did. I'm on the road now but I'll update it later. In the meantime you can see all the code in [this commit](https://github.com/travisdowns/uarch-bench/commit/62e47379e29d64add1d481e645c918772fd7df38). — BeeOnRope, Apr 24 '18 at 16:53
About your first comment, it's not the test/flag setting instruction like `add` that matters here but the location of the conditional jump like `jne`. The compiler really has very little flexibility in placing that since it needs to go exactly before the code that it will jump over. Once the jump position is set the compiler wants to put the flag setting instruction immediately before to take advantage of macrofusion. — BeeOnRope, Apr 24 '18 at 16:56
But basically the example shows that if the condition of the branch was *changed* so that it's faster to evaluate (like in `sum_counter`), then the branch result can be resolved faster and the penalty of (nested) branch misprediction is reduced. However, that doesn't show that calculating the *same* condition earlier benefits performance (as shown in the OP's question). It might be hard to give such an example at the source code level since the compiler's instruction scheduler might automatically do that. I think that would particularly be beneficial for OoO pipelines... — Hadi Brais, Apr 24 '18 at 18:01
...Yes, the instruction that calculates the flag `cmp edi,edx` itself should be placed before the branch to enable macrofusion. But I'm talking about the instructions that it depends on like `add edx,0x1`. — Hadi Brais, Apr 24 '18 at 18:01
@HadiBrais - yes, the way the condition was calculated changed. That's kind of the point: you need to affect the _data flow graph_ and that means a change in the source, since re-ordering independent lines (or assembly) doesn't affect the data flow graph. However, I disagree that I changed it to make the calculation _faster_, as at least as most people would understand that term: the `sum_counter` variant has _more_ instructions, more total uops, etc. What changed is the position of the branch in the data flow graph: it has moved up (i.e., closer to the root node). — BeeOnRope, Apr 24 '18 at 21:42
Perhaps that's what you mean by _faster_, but I think a better word is _earlier_. The main point here is that by doing more total work to calculate the branch condition in a different way, you can speed up execution by evaluating the branches earlier. Without that effect, adding work wouldn't really make sense. Note this isn't a typical "dependency chain limited" analysis: both algorithms have the same key loop-carried dependency chain: the difference is in how the branch "hangs off" that dependency chain or doesn't. — BeeOnRope, Apr 24 '18 at 21:44
I think I was clear enough about that: the specific examples the OP gave doesn't work (for many reasons beyond the fact that it doesn't affect the data flow), but the general idea of calculating the branch _earlier_ (in the data flow graph), does work. — BeeOnRope, Apr 24 '18 at 21:45
At the assembly and probably at the C/C++ level you could probably come up with an example where moving jump instructions alone helps: the order instructions appear in the instruction stream also has an effect on execution order: the order is constrained by the data flow graph, but within those constraints, instructions that appear earlier are likely to be execution earlier (both because CPU schedulers favor them when they have several options, and because they become available earlier for scheduling _at all_). This effect is fairly mild, however - a cycle here or there. — BeeOnRope, Apr 24 '18 at 21:48
In particular, it is even clear that moving branches "up" in this sense (in the assembly listing) is even necessarily helpful: it moves up the point in "cycle time" which the branch misprediction is encountered, but whether that is better or worse than later depends on the instructions in flight: for example, if the code is transitioning from front-end to back-end or dep chain limited, it is probably better to take the mis-predict later, since it will only make the front-end bottleneck worse. OTOH moving the branch up in the data flow graph is almost always good. — BeeOnRope, Apr 24 '18 at 21:51
This is one of the most interesting answers I have ever seen on SO. — Nemo, Nov 02 '19 at 22:30

score 6 · Answer 2 · answered Apr 20 '18 at 01:17

6

Out-of-order execution is definitely a thing (not just compilers but also even the processor chips themselves can reorder instructions), but it helps more with pipeline stalls caused by data dependencies than those caused by mispredictions.

The benefit in control flow scenarios is somewhat limited by the fact that on most architectures, the conditional branch instructions make their decision only based on the flags register, not based on a general-purpose register. It's hard to set up the flags register far in advance unless the intervening "work" is very unusual, because most instructions change the flags register (on most architectures).

Perhaps identifying the combination of

TST (reg)
J(condition)

could be designed to minimize the stall when (reg) is set far enough in advance. This of course requires a large degree of help from the processor, not just the compiler. And the processor designers are likely to optimize for a more general case of early (out of order) execution of the instruction which sets the flags for the branch, with the resulting flags forwarded through the pipeline, ending the stall early.

answered Apr 20 '18 at 01:17

Ben Voigt

277,958
43
419
720

Yes, but you can do *most* of the work for a branch ahead of time, leaving only the final `cmp/jcc` (which on modern x86 macro-fuses into a single compare-and-branch uop, so it *does* in fact branch on a register compare directly, as well as producing the flag output.) Actual execution of branch instructions (to check the prediction result) without macro-fusion isn't special; it has a normal data dependency flags just like `setcc` or add-with-carry. Your description of flags being "forwarded through the pipeline" makes it sound like it's handled specially, but it actually isn't. – Peter Cordes Apr 24 '18 at 22:43
@PeterCordes: But what OP was suggesting was putting `cmp` earlier... which would result in the wrong flags visible to the jump. He can put `sub` to do the comparison early, with `tst`+`j(cc)` together, but as you said the OOO execution engine already recognizes `cmp`+`j(cc)` so trying to perform the comparison in advance is pointless. – Ben Voigt Apr 24 '18 at 23:19
2

The OP was talking about reordering the C source in a way that doesn't change the semantics. You're right that doing an early `cmp` wouldn't be a valid implementation in asm in most cases, and doing extra work to compare into a register (cmp/setcc to prepare for a `test/jnz` later) wouldn't make sense. Anyway yes, `a – Peter Cordes Apr 24 '18 at 23:28
But the key thing that's wrong with your last paragraph is that `jcc` or fused `cmp/jcc` are both scheduled just like any other instruction, normally in oldest-ready-first order. Branch uops don't get prioritized for early execution, so they only get executed when their inputs are ready (flags or registers) and there's a spare execution port. (Haswell runs predicted-taken branches only on port6, or predicted-not-taken branches on p0 or p6). If there's a *lot* of earlier independent instructions, the `jcc` might not execute early even if its inputs were ready early. (Unlike @Bee's low-ILP) – Peter Cordes Apr 24 '18 at 23:33
1

Also, ARM in ARM-mode can easily avoid flag-setting, it's a per-instruction choice like on SPARC `addcc` vs. `add`. ARM Thumb mode makes `adds` (add and set flags) better than `add`, though. MIPS doesn't even have flags, and you do compare into a register for more complex conditions. But yeah, on x86 it's not worth trying to avoid flag-setting for long (although putting `cmp` a couple instructions ahead of `jcc` was a useful optimization on in-order Pentium). Some other RISCs also have flags that are set by most instructions, like x86, I think. – Peter Cordes Apr 24 '18 at 23:50
As usual, PowerPC is complex: its CR (condition register) is divided into 8 4-bit fields, and you [can compare into any of them and branch or cmov on any of them](https://developer.apple.com/library/content/documentation/DeveloperTools/Reference/Assembler/050-PowerPC_Addressing_Modes_and_Assembler_Instructions/ppc_instructions.html), so you can have multiple flag dependency chains in flight at once even without out-of-order exec. It's mostly just CISC like x86 and m68k that make it hard to set flags early. – Peter Cordes Apr 24 '18 at 23:53

score 3 · Answer 3 · answered Jun 08 '18 at 06:09

The main problem with branch misprediction is not the few cycles it incurs as penalty while flushing younger operations (which is relatively fast), but the fact that it may occur very late along the pipe if there are data dependencies the branch condition has to resolve first.

With branches based on prior calculations, the dependency works just like with other operations. Additionally, the branch passes through prediction very early along the pipe so that the machine can go on fetching and allocating further operations. If the prediction was incorrect (which is more often the case with data-dependent branches, unlike loop controls that usually exhibit more predictable patterns), than the flush would occur only when the dependency was resolved and the prediction was proven to be wrong. The later that happens, the bigger the penalty.

Since out-of-order execution schedules operations as soon as dependency is resolved (assuming no port stress), moving the operation ahead is probably not going to help as it does not change the dependency chain and would not affect the scheduling time too much. The only potential benefit is if you move it far enough up so that the OOO window can see it much earlier, but modern CPUs usually run hundreds of instructions ahead, and hoisting instructions that far without breaking the program is hard. If you're running some loop though, it might be simple to compute the conditions of future iterations ahead, if possible.

None of this is going to change the prediction process which is completely orthogonal, but once the branch reaches the OOO part of the machine, it will get resolved immediately, clear if needed, and incur minimal penalty.

OoO exec does typically run instructions in oldest-ready-first order, so putting the critical path instructions early might matter for avoiding resource conflicts. (Multiple instructions ready, not enough execution units available to run them all). Execution after a cache miss or other back-end stall tends to be somewhat bursty. It's plausible there could be cases where there's something to gain by putting the critical path instructions ahead of other independent work. But still +1, in general OoO exec makes this close to a non-issue. — Peter Cordes, Jun 08 '18 at 08:02

Avoid stalling pipeline by calculating conditional early

3 Answers3

What doesn't work

What does work

Traversing a linked list

Calculating iteration count ahead of time

Linked

Related