About reducing the branch miss prediciton

Question

I saw a sentence in a paper "Transforming branches into data dependencies to avoid mispredicted branches." (Page 6)

I wonder how to change the code from branches into data dependencies.

This is the paper: http://www.adms-conf.org/p1-SCHLEGEL.pdf

Update: How to transform branches in the binary search?

(While I commend the nomination of some heads of government for "Miss Management", it is _mis-prediction_ (even for titles).) — greybeard, Nov 04 '16 at 07:12

Jerry Coffin · Accepted Answer · 2021-02-10T21:03:05.297

The basic idea (I would presume) would be to change something like:

if (a>b)
    return "A is greater than B";
else
    return "A is less than or equal to B";

into:

static char const *strings[] = {
    "A is less than or equal to B", 
    "A is greater than B"
};

return strings[a>b];

For branches in a binary search, let's consider the basic idea of the "normal" binary search, which typically looks (at least vaguely) like this:

while (left < right) {
    middle = (left + right)/2;
    if (search_for < array[middle])
        right = middle;
    else
        left = middle;
}

We could get rid of most of the branching here in pretty much the same way:

size_t positions[] = {left, right};

while (left < right) {
    size_t middle = (left + right)/2;

    positions[search_for >= array[middle]] = middle;
}

[For general purpose code use left + (right-left)/2 instead of (left+right)/2.]

We do still have the branching for the loop itself, of course, but we're generally a lot less concerned about that--that branch is extremely amenable to prediction, so even if we did eliminate it, doing so would gain little as a rule.

score 1 · Answer 2 · answered Nov 03 '16 at 17:58

Removing branches is not always optimal, even (especially) with simple binary conditions like this. I have looked into removing branches in a similar manner in various circumstances before. After running into an instance where code with a conditional branch in a loop ran faster than equivalent, branch-less code, I did some research on processor execution strategies.

I knew that the ARM instruction set had conditional instructions, which could make a conditional branch faster than the kind of branch-less code in the paper, but I was working on an intel (and the branch was not one that cmove could take care of). It turns out that modern CPUs, including intel, will sometimes turn an ordinary instruction into a conditional one: they use eager execution if the two end points of the branch are sufficiently short. That is, the CPU will put both possible paths in the pipeline and execute them both and only keep the correct result once the condition is known. This avoids the possibility of a mis-prediction without having to index an array. See https://en.wikipedia.org/wiki/Speculative_execution#Variants for more information.

It takes a lot of detailed knowledge of a processor to write optimal code for it, even in assembly. This makes it possible for "naive" implementations to sometimes be faster than hand-optimized ones that overlook certain CPU characteristics. The same thing can happen with optimizing compilers: sometimes writing more complicated "optimized" code breaks some of the compiler's optimizations and results in a slower executable than a naive implementation that the compiler can optimize fully.

When in doubt and performance is critical and there is time, it is usually best to try it both ways and see which is faster!

About reducing the branch miss prediciton

2 Answers2

Linked