2

Lets say my code is the following. It's a silly nonsense example but the point is there's at least 2 cycles of work before getting to the branch. Maybe more since the multiply depends on previous values.

Is there any chance of this taking the correct path 100% of the time? Is the instruction a different opcode when you know the path ahead of time? Is there instructions I need to use related to prefetching if I want to tell it which path it will be taking?

int test(int v, int*p) {
    for(;;)
    {
        if (v<100)
        {
            auto nextPath = v>50;
            sum+=p[0]
            sum*=p[1]
            memberVar++
            len+=10;
            sum-=p[3]
            sum*=p[4]
            memberVar++
            len+=5
            if (nextPath)
                break;
            else
                v=p[5]
        }
    }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Eric Stotch
  • 141
  • 4
  • 19
  • show the disassembly and the branch you are referring to – old_timer Nov 20 '20 at 19:25
  • branch prediction comes in more than one form and I dont know what intel supports. one is what one might think when hearing the term, that the logic looks early in the pipe for a branch while examining other stages in the pipe to determine if the condition has been determined and get a fetch started a few cycles early. (the non-branch path will normally still fetch as well and the pipe/core uses whichever). but another is a cache, where each branch for some n last number of branches the destination address is cached and the logic on the next time simply fetches that, so helps loops. – old_timer Nov 20 '20 at 19:35
  • but you have not shown the branches in question because this is high level code. pseudo-code. – old_timer Nov 20 '20 at 19:36

1 Answers1

3

Pentium-4 has branch hints (using segment prefixes), but they tended not to be useful in practice so later CPUs silently ignore them. (Intel x86 0x2E/0x3E Prefix Branch Prediction actually used?) This was x86's only experiment with hardware branch hints. (Some other ISAs have also had them and may still support them).

See also Is it possible to tell the branch predictor how likely it is to follow the branch?, including my answer on that question. You can hint the compiler how to lay out if bodies, e.g. so the common case is fall-through instead of taken, which does help. That's what __builtin_expect is for, or the C++20 [[likely]] and [[unlikely]] attribute.

But those compiler hints do not interact with CPU hardware branch prediction. (Except in the sense that they may convince the compiler to use branchy code instead of branchless, so there is a branch that needs predicting at all, instead of a cmov or something for cases where branchless was possible.)


If you have a special iteration with known branching, you can peel it and optimize it specially before entering the actual loop.

Or if your branch is always taken for the first part of a loop, then always not-taken, you can split your loop into two halves: one with the always-true case the other optimized for the always-false case.

Or if the branch direction is totally loop-invariant, then only one of the loop versions ever needs to run. That's kind of a loop-versioning optimization. Optimizing a branch inside a loop into 2 separate loops costs code-size, but can be a win if it runs frequently, especially if you can optimize the now-unconditional code into the surrounding code.


Your case appears to be totally loop-invariant after the first iteration, so peel that first iteration to use the initial v (from the function arg), then later iterations can infinite-loop or do a single iteration. Not really much to gain here; hoisting the p[5] load and using a compare-and-branch on it at the bottom of the loop would appear to do the trick.

Your p isn't modified in the loop, and p[5] is read-only. So after the first iteration, v = p[5] is loop-invariant. (Assuming this is C-like pseudocode (without ;) and thus data-race undefined behaviour can be assumed not to happen, i.e. no async modifications to p.)

len, sum, and memberVar aren't declared, but unless one is a C++ reference to the same int as p[5] there's no possible aliasing.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hmm... The 2E/3E does in fact seem useless because I'd need to give it a bool otherwise I could do what you said and generate an optimized path. This current code is already very fast I just feel like optimizing while the code is still in my head. I think I tried likely/unlikely and got a compile error. __builtin_expect with a bool gave me code generation I wanted in a few spots which was nice. The switch statement I'm optimizing usually has a `&1` at the start of each case but only a few will use the result in an if. Those were the ones I was wondering can be further optimized – Eric Stotch Nov 20 '20 at 21:09
  • Something that's always puzzled me slightly: on a machine that predicts branches, why does it help to have the compiler arrange the code so that "not taken" is the common case? If the CPU incorrectly predicts the branch and thinks that it *will* be taken, won't the performance be just as bad as if it were the other way around? – Nate Eldredge Nov 20 '20 at 21:17
  • @NateEldredge: fetch happens in contiguous blocks of machine code. Or in contiguous uop-cache entries. Taken branches result in shorter useful fetches, unless they happen to be right at the end of a fetch-group anyway. If you're close to a front-end bottleneck, the bubble / short-fetch from a taken branch can hurt. – Peter Cordes Nov 20 '20 at 21:18
  • Ah, I see. Thanks! – Nate Eldredge Nov 20 '20 at 21:20
  • @NateEldredge: Also, if a branch is *never* taken, on some CPUs it could just get ignored by the branch predictors and never take up a predictor entry. (If the CPU statically predicts not-taken for "unknown" branches where there's no dynamic prediction.) Less pollution = more room for other branches. – Peter Cordes Nov 20 '20 at 21:20
  • @NateEldredge: Also, on Haswell and later, predicted-not-taken branch uops can run on port 0 or 6 to verify that prediction. Predicted-taken uops can only run on port 6. So back-end throughput is 2 branches per clock, with up to one of them being a *taken* branch. (Front and back end are decoupled by buffers and OoO exec, and it's not like you'd always need to sustain that throughput by running nothing but a stupid tiny loop with a taken branch + loop branch. But the front-end probably can't sustain 2 taken branches per clock from the uop cache either. Maybe from the loop buffer (LSD)) – Peter Cordes Nov 20 '20 at 21:22