I'm targeting Skylake hardware and compiling using Clang.
Say we have some code structured like this:
int always_runs(int i, int acc); // ~30 cycles
int sometimes_runs(int i, int acc); // ~20 cycles
int foo(const std::array<bool, 1000000> should_branch) {
int acc = 0;
for (int i = 0; i < should_branch.size(); i++) {
acc += always_runs(i, acc); // takes in acc, loop carried dependency
// unpredictable branch based on constant data
if (should_branch[i]) acc += sometimes_runs(i, acc); // takes in acc, loop carried dependency
}
return acc;
}
The data in should_branch
follows no particular pattern, and the branch predictor has a bad time with it.
One option is to remove the branch and replace it with a conditional move like this:
acc += should_branch[i] ? sometimes_runs(i, acc) : 0;
That removes the branch misprediction, but means that unnecessary work is being done. This is the fastest solution that I've found so far, and what I currently have implemented.
Is there any way to exploit the fact that should_branch
is known and constant, so that the predictor can somehow look ahead and predict correctly based on the data? From a human perspective this is a trivial prediction.
One option that I've looked at is manually unrolling the prediction, so that I have one path for a run of {true, true}
, one for {true, false}
, one for {false, true}
, one for {false, false}
etc. This was not helpful, as the mispredictions caused by branching between the static paths still results in slower code than simply eliminating the branch as above.
Is there anything else that can be done to optimise code following this pattern? Changing the algorithm to avoid the problem is not an option.