3

I'm targeting Skylake hardware and compiling using Clang.

Say we have some code structured like this:

int always_runs(int i, int acc); // ~30 cycles
int sometimes_runs(int i, int acc); // ~20 cycles

int foo(const std::array<bool, 1000000> should_branch) {
    int acc = 0;

    for (int i = 0; i < should_branch.size(); i++) {
        acc += always_runs(i, acc); // takes in acc, loop carried dependency
        // unpredictable branch based on constant data
        if (should_branch[i]) acc += sometimes_runs(i, acc); // takes in acc, loop carried dependency
    }

    return acc;
}

The data in should_branch follows no particular pattern, and the branch predictor has a bad time with it.

One option is to remove the branch and replace it with a conditional move like this:

acc += should_branch[i] ? sometimes_runs(i, acc) : 0;

That removes the branch misprediction, but means that unnecessary work is being done. This is the fastest solution that I've found so far, and what I currently have implemented.

Is there any way to exploit the fact that should_branch is known and constant, so that the predictor can somehow look ahead and predict correctly based on the data? From a human perspective this is a trivial prediction.

One option that I've looked at is manually unrolling the prediction, so that I have one path for a run of {true, true}, one for {true, false}, one for {false, true}, one for {false, false} etc. This was not helpful, as the mispredictions caused by branching between the static paths still results in slower code than simply eliminating the branch as above.

Is there anything else that can be done to optimise code following this pattern? Changing the algorithm to avoid the problem is not an option.

Sam
  • 410
  • 2
  • 10
  • Have you measured your "One option"? There's some amount (unknown to us) of _unnecessary work_, but real-world projects have improved their speed by choosing unnecessary work over a conditional branch. Also, consider `acc += should_branch[i] * extra;` since the ternary operator can compile to a conditional branch. – Drew Dormann Mar 12 '21 at 19:01
  • @DrewDormann The "one option" is what I have currently implemented. It's the fastest option that I've found so far, but I'm interested in improving on it. I've checked the assembly and the ternary is compiling to a conditional move as expected. I'm monitoring the assembly carefully for this whole tight loop. – Sam Mar 12 '21 at 19:06
  • I believe that the definitive answer to this question is "No. There is no way to hint prediction based on data on Skylake hardware." I'm asking because I've not found a definitive source saying that it's impossible - I've only found a lack of people discussing how to do it. – Sam Mar 12 '21 at 19:08
  • I've also measured performance with perfect prediction by artificially setting `should_branch` to perfectly alternate, which the predictor handles fine. Perfect prediction would result in an approximately 20% speedup compared to the branchless conditional move. – Sam Mar 12 '21 at 19:13
  • How much work is done in `sometimes_runs()`, and does it have side effects? You might be able to not have any branch at all - e.g. `acc += always_runs(i, acc) + should_branch[i] * sometimes_runs(i, acc);`. – Brendan Mar 12 '21 at 19:30
  • @Brendan I discussed that in the question - see the second code sample. That's the option that I currently have implemented, and what I'm hoping to improve upon. – Sam Mar 12 '21 at 19:33
  • 1
    *so that the predictor can somehow look ahead and predict correctly based on the data* - by the time data is available, prediction isn't needed. At that point the CPU is just *checking* the earlier prediction. Prediction is needed very early in the pipeline, before the instructions are even decoded, let alone have their operands fetched. What can help is having the branching be able to *exec* early, to check the prediction while other work is still in the pipline. [Avoid stalling pipeline by calculating conditional early](https://stackoverflow.com/q/49932119). – Peter Cordes Mar 13 '21 at 05:20
  • 1
    Perhaps peel the first couple iterations and do them branchlessly, then in the loop load the condition for the *next* iteration, then check *this* iteration's condition and do the work or not. So that maybe hides L1d load-use latency in recovering from a mispredict? And/or peeling the first couple iterations branchlessly puts some work in the pipeline for the CPU to be chewing on while execution of the later branches hopefully manages to run early, although uops are scheduled oldest-read-first so it will have to compete with others for port 6 (or port 1 for predicted-not-taken) – Peter Cordes Mar 13 '21 at 05:24
  • @PeterCordes What you and the answers on that inked question say matches my initial intuition, which was that the lack of data dependency required for the loop counter means that it should be able to predict ahead. Measuring tests persuaded me otherwise, although I'm now having doubts about something else explaining that. I won't be able to measure your suggestions until later, but could you explain how manually prefetching would help? I would have thought that the CPU would automatically be doing that as soon as its dependency resolved already, and it's dependency is only the loop counter. – Sam Mar 13 '21 at 08:35
  • 1
    Everything after a mispredict has to get thrown away, even if it would also happen on the correct path. (It would be too much work for the CPU to figure out that what it can/can't keep from the wrong path.) So putting the load *before* a branch means that's not part of the latency from (a) resuming instruction-issue after a mispredict to (b) executing the branch uop to confirm correct prediction or detect misprediction and start recovering. – Peter Cordes Mar 13 '21 at 09:09
  • 1
    You'd hope the "always runs" part could overlap with checking `should_branch[i]`, but it may be that overlapping execution between iterations is actually important to keep the back-end busy (requiring correct predictions). And you still lose some cycles of front-end throughput on a branch miss, as well as having wasted any beyond the branch. Also, the always_runs part may have some uops for ports 0 or 6 that get executed before the branch uop; branches don't have priority. Hyperthreading could be good here; it's a good way to keep the CPU doing something useful during branch recovery. – Peter Cordes Mar 13 '21 at 09:12

0 Answers0