197

Specifically, if I have a series of if...else if statements, and I somehow know beforehand the relative probability that each statement will evaluate to true, how much difference in execution time does it make to sort them in order of probability? For example, should I prefer this:

if (highly_likely)
  //do something
else if (somewhat_likely)
  //do something
else if (unlikely)
  //do something

to this?:

if (unlikely)
  //do something
else if (somewhat_likely)
  //do something
else if (highly_likely)
  //do something

It seems obvious that the sorted version would be faster, however for readability or the existence of side-effects, we might want to order them non-optimally. It's also hard to tell how well the CPU will do with branch prediction until you actually run the code.

So, in the course of experimenting with this, I ended up answering my own question for a specific case, however I'd like to hear other opinions/insights as well.

Important: this question assumes that the if statements can be arbitrarily reordered without having any other effects on the behavior of the program. In my answer, the three conditional tests are mutually exclusive and produce no side effects. Certainly, if the statements must be evaluated in a certain order to achieve some desired behavior, then the issue of efficiency is moot.

hippietrail
  • 15,848
  • 18
  • 99
  • 158
Carlton
  • 4,217
  • 2
  • 24
  • 40
  • 37
    you might want to add a note that the conditions are mutually exclusive, otherwise the two version are not equivalent – 463035818_is_not_an_ai Oct 19 '17 at 15:26
  • 28
    It's pretty interesting how a self answered question got 20+ upvotes with rather poor answer, in an hour. Not calling anything on OP but upvoters should beware of jumping on band wagon. The question might be interesting, but results are doubtful. – luk32 Oct 19 '17 at 16:46
  • 3
    I believe this can be described as a form of [short-circuit evaluation](https://en.wikipedia.org/wiki/Short-circuit_evaluation) because hitting one comparison denies hitting a different comparison. I personally favor an implementation like this when one fast comparison, let's say boolean, can prevent me from going into a different comparison which might involve a resource-heavy string manipulation, regex, or database interaction. – MonkeyZeus Oct 19 '17 at 17:16
  • 1
    I changed the title of the question slightly because I'm getting a lot of answers and comments to the effect of "write for readability, not speed". The question was meant to ask how much of a difference if-statement order makes, not whether or not we should compromise readability for it. The code that prompted this question solves engineering equations and runs for several minutes or hours at a time. Speeding up a critical if-statement just by rearranging it is very low-hanging fruit - worthy of some loss of readability. – Carlton Oct 19 '17 at 17:18
  • 11
    Some compilers offer the ability to gather statistics on branches taken and feed these back into the compiler to allow it to do better optimizations. –  Oct 19 '17 at 17:55
  • 11
    If performance like this matter to you, you should probably try Profile Guided Optimization and compare your manual result with the compiler's result – Justin Oct 19 '17 at 18:51
  • https://cs.stackexchange.com/q/35502/755, https://cs.stackexchange.com/q/66894/755 – D.W. Oct 20 '17 at 19:24
  • 1
    Ive seen a lot of questions like this, but this is the only one where every answer isnt 'Dont microoptimize.' I wonder what makes this one different... – kingfrito_5005 Oct 20 '17 at 22:03
  • 1
    @kingfrito_5005 I'd say it's probably because even the benchmark by the OP shows that there's no reliable answer. Whether sorted `if`s is actually a performance improvement is very situational and target dependent, so there may be no good rule-of-thumb. Furthermore, other easier optimization techniques are likely to have a greater effect (such as PGO). I don't think anyone is saying you shouldn't ever reorder `if`s to find the optimal ordering, but that you should profile the different arrangements if that little extra bit of performance actually matters to you. – Justin Oct 20 '17 at 23:51
  • Here and in your answer you refer to the alternative as "unsorted". Both options are sorted, one desc and the other asc. – Luke Sawczak Oct 21 '17 at 04:01
  • each 'if()` statement takes some finite amount of time to evaluate. So the most probable 'if()' should be placed first. However, given the speed of today's CPUs, the effect on overall execution time is minimal. I.E. not worth doing unless that sequence of statements are going to be executed thousands or millions of times. It would also be worth doing when performing some (severely) time constrained function, like which chess move to make next. However, that was before 'caching' CPUs, so which is quicker can now go either way, depending on what is currently in the cache – user3629249 Oct 24 '17 at 22:09
  • You do not say if you expect the **evaluation** of the conditions (as opposed to subsequent branching) to have a significant cost, though obviously (as per at least one answer) is highly relevant. If they are all equally and highly expensive, frequency ordering seems desirable. If one is much cheaper, do that first unless extremely unlikely. **Do the calculations and the benchmarks.** (_I see now that_ @psmears _has added this as a comment to [@Yakk’s answer](https://stackoverflow.com/a/46835888/4847772)._) – PJTraill Oct 25 '17 at 16:29
  • 1
    Keep in mind that the assumption that the compiler will _even care_ about the order you put your checks in may not be correct. Of course if the evaluation of the conditional has a side effect or the conditions aren't disjoint it can't change the order, but then the different orders aren't semantically equivalent anyways. If the conditions are disjoint and side-effect free, the compiler is perfectly free to reorder them, as [clang does](https://godbolt.org/g/DJ7j6m) - it always puts the `(p > 0)` check first regardless of source order: it has heuristics that positive values are more likely. – BeeOnRope Nov 04 '17 at 00:28

10 Answers10

106

As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them. See Godbolt's work.

After that, the branch goes into a branch prediction cache, and past behavior is used to inform future branch prediction.

So in a tight loop, the effect of misordering is going to be relatively small. The branch predictor is going to learn which set of branches is most likely, and if you have non-trivial amount of work in the loop the small differences won't add up much.

In general code, most compilers by default (lacking another reason) will order the produced machine code roughly the way you ordered it in your code. Thus if statements are forward branches when they fail.

So you should order your branches in the order of decreasing likelihood to get the best branch prediction from a "first encounter".

A microbenchmark that loops tightly many times over a set of conditions and does trivial work is going to dominated by tiny effects of instruction count and the like, and little in the way of relative branch prediction issues. So in this case you must profile, as rules of thumb won't be reliable.

On top of that, vectorization and many other optimizations apply to tiny tight loops.

So in general code, put most likely code within the if block, and that will result in the fewest un-cached branch prediction misses. In tight loops, follow the general rule to start, and if you need to know more you have little choice but to profile.

Naturally this all goes out the window if some tests are far cheaper than others.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 23
    It's also worth considering how expensive the tests themselves are: if one test is only slightly more likely, but a *lot* more expensive, then it may be worth putting the other test first, because the savings from not making the expensive test will likely outweigh the savings from branch prediction etc. – psmears Oct 20 '17 at 16:38
  • The [link](https://xania.org/201602/bpu-part-two) you provided doesn't support your conclusion _As a general rule, most if not all Intel CPUs assume forward branches are not taken the first time they see them_. In fact that is only true for the relatively obscure Arrendale CPU whose results are shown first. The mainstream Ivy Bridge and Haswell results don't support that at all. Haswell looks very close to "always predict fall through" for unseen branches, and Ivy Bridge is not clear at all. – BeeOnRope Nov 04 '17 at 00:13
  • It is generally understood that CPUs are not really using static predictions as they did in the past. Indeed modern Intel is probably using something like a probabilistic TAGE predictor. You just hash branch history into various history tables and take one that matches with the longest history. It uses a "tag" to try to avoid aliasing, but the tag only has a few bits. If you miss at all history lengths, some default prediction is probably made which doesn't necessarily depend on branch direction (on Haswell we can say it clearly doesn't). – BeeOnRope Nov 04 '17 at 00:17
44

I made up the following test to time the execution of two different if...else if blocks, one sorted in order of probability, the other sorted in reverse order:

#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>
#include <iterator>
#include <functional>

using namespace std;

int main()
{
    long long sortedTime = 0;
    long long reverseTime = 0;

    for (int n = 0; n != 500; ++n)
    {
        //Generate a vector of 5000 random integers from 1 to 100
        random_device rnd_device;
        mt19937 rnd_engine(rnd_device());
        uniform_int_distribution<int> rnd_dist(1, 100);
        auto gen = std::bind(rnd_dist, rnd_engine);
        vector<int> rand_vec(5000);
        generate(begin(rand_vec), end(rand_vec), gen);

        volatile int nLow, nMid, nHigh;
        chrono::time_point<chrono::high_resolution_clock> start, end;

        //Sort the conditional statements in order of increasing likelyhood
        nLow = nMid = nHigh = 0;
        start = chrono::high_resolution_clock::now();
        for (int& i : rand_vec) {
            if (i >= 95) ++nHigh;               //Least likely branch
            else if (i < 20) ++nLow;
            else if (i >= 20 && i < 95) ++nMid; //Most likely branch
        }
        end = chrono::high_resolution_clock::now();
        reverseTime += chrono::duration_cast<chrono::nanoseconds>(end-start).count();

        //Sort the conditional statements in order of decreasing likelyhood
        nLow = nMid = nHigh = 0;
        start = chrono::high_resolution_clock::now();
        for (int& i : rand_vec) {
            if (i >= 20 && i < 95) ++nMid;  //Most likely branch
            else if (i < 20) ++nLow;
            else if (i >= 95) ++nHigh;      //Least likely branch
        }
        end = chrono::high_resolution_clock::now();
        sortedTime += chrono::duration_cast<chrono::nanoseconds>(end-start).count();

    }

    cout << "Percentage difference: " << 100 * (double(reverseTime) - double(sortedTime)) / double(sortedTime) << endl << endl;
}

Using MSVC2017 with /O2, the results show that the sorted version is consistently about 28% faster than the unsorted version. Per luk32's comment, I also switched the order of the two tests, which makes a noticeable difference (22% vs 28%). The code was run under Windows 7 on an Intel Xeon E5-2697 v2. This is, of course, very problem-specific and should not be interpreted as a conclusive answer.

Tim
  • 41,901
  • 18
  • 127
  • 145
Carlton
  • 4,217
  • 2
  • 24
  • 40
  • 9
    OP should be careful though, as changing an `if... else if` statement could have a substantial effect on how logic flows through the code. The `unlikely` check may not come up often, but there could be a business need to check for the `unlikely` condition first before checking for others. – Luke T Brooks Oct 19 '17 at 15:20
  • @tobi303 Yes, I specifically used a random vector to thwart branch prediction. I also ran it with a sorted vector which, not surprisingly, resulted in virtually no difference in performance. – Carlton Oct 19 '17 at 15:27
  • 21
    30% faster? You mean it was faster by roughly the % of extra if statements it didn't have to perform? Seems a pretty reasonable result. – UKMonkey Oct 19 '17 at 15:30
  • 5
    How did you benchmark it? Which compiler, cpu, etc.? I am pretty sure this result is not portable. – luk32 Oct 19 '17 at 16:22
  • 2
    Out of curiosity, have you tried reversing the benchmark order? I doubt it'll make a difference, but it's interesting. – luk32 Oct 19 '17 at 16:55
  • 12
    A problem with this microbenchmark is that the CPU is going to work out which of the branches is most likely and cache it when you repeatedly loop over it. If the branches where not examined in a small tight loop, the branch prediction cache might not have them in it, and costs could be much higher if the CPU guesses wrong with zero branch prediction cache guidance. – Yakk - Adam Nevraumont Oct 19 '17 at 17:31
  • 1
    Yep this is at once very pessimistic (we're doing basically no work in the branches so the extra trivial cmp will be highly noticable) and at the same time incredibly optimistic (the branch predictor in any modern x86 CPU is going to make sure that misprediction penalties are identical between both pieces of code). You can basically make that benchmark return anything you want depending on how you tweak two parameters.. – Voo Oct 19 '17 at 18:07
  • 1
    @LukeTBrooks actually I can think of an unlikely scenario that you need to check first. Checking for nulls. It may be unlikely, but you absolutely must check it first or else your program will crash. – Nelson Oct 19 '17 at 18:55
  • 6
    This benchmark isn't too reliable. Compiling with **gcc 6.3.0**: `g++ -O2 -march=native -std=c++14` does give a slight edge to the sorted conditional statements, but most of the time, the percent difference between the two runs was ~5%. Several times, it was actually slower (due to variances). I'm fairly sure that ordering the `if`s like this is not worth worrying over; PGO will probably completely handle any such cases – Justin Oct 19 '17 at 19:04
  • @Nelson That is a great real world example. I think there are plenty of better ways to optimize your code that don't involve moving around your `if` statement checks. – Luke T Brooks Oct 19 '17 at 19:07
  • 3
    Interestingly, once I turned on PGO, the sorted version tended to run slower. The compiler was able to generate smaller assembly for the reverse-sorted version. That's not to say you should write the code in the reverse-sorted order of likeliness, but that if you are thinking about worrying about this, you should first turn on your compiler optimizations (set the target arch, LTO if it applies, and use PGO) – Justin Oct 19 '17 at 19:25
  • I updated the test code to better represent my actual use case. The major difference is that I generate a new random vector every time the comparison test is repeated. The old code reused the same sequence of random numbers for every test. – Carlton Oct 19 '17 at 19:30
  • With clang on MacOS I see a difference of about 17%, but only for a debug build. For a release build, the difference goes down to 1-2%. I think that your compiler just does not recognise that the `(i >= 20 && i < 95)` is always true in the first loop. Remove that unnecessary condition, and the difference should essentially go away. – Carsten S Oct 19 '17 at 23:06
  • I just ran the same test on Ubuntu using GCC 5.4, on a Core i7 processor. The results are quite different from the Windows/MSVC version. Now I'm seeing the reverse-sorted `if` statements running faster than the sorted one (by about 3%). Not sure why there's such a big difference between the two. – Carlton Oct 19 '17 at 23:38
  • 4
    "...that the sorted version..." They're *both* sorted. That statement is ambiguous to the point of confusion. – jpmc26 Oct 20 '17 at 01:03
  • 2
    This is an example why you shouldn't optimize something your compiler is design to do (PGO) by yourself. The optimal solution here is to use a Hu–Tucker shaped branch tree (an order preserving optimal depth tree). In this simple example only 2 compares are necessary (instead of the 4 your code has). Moreover this code could easily be written branchless (and some compilers might actually do that). PGO will do all that for you. – Christoph Diegelmann Oct 20 '17 at 07:34
  • 1
    @Christoph - any decent optimizer will hoist `i < 20` and `i < 95` (or equivalent). So it's likely all down to branch placement. But broad statements such as OP's, based on a single compiler and target architecture are, at least, *unwise*. – Toby Speight Oct 20 '17 at 15:12
  • 1
    @UKMonkey Naturally that's why it's faster; minimizing the number of operations to be performed is the point of doing this optimization. I'd say this is still a useful result, if only because a lot of people seem to assume that compilers will magically find all possible optimizations, and thus there's no point in optimizing manually. This is a solid example of a situation where that's not the case. – Ray Oct 21 '17 at 19:34
  • "28% faster than the unsorted version" - Do you mean "than the reverse-sorted version"? As a previous commenter points out, they are *both* sorted. – AJM Jun 30 '21 at 20:11
31

Just my 5 cents. It seems the effect of ordering if statements should depend on:

  1. Probability of each if statement.

  2. Number of iterations, so the branch predictor could kick in.

  3. Likely/unlikely compiler hints, i.e. code layout.

To explore those factors, I benchmarked the following functions:

ordered_ifs()

for (i = 0; i < data_sz * 1024; i++) {
    if (data[i] < check_point) // highly likely
        s += 3;
    else if (data[i] > check_point) // samewhat likely
        s += 2;
    else if (data[i] == check_point) // very unlikely
        s += 1;
}

reversed_ifs()

for (i = 0; i < data_sz * 1024; i++) {
    if (data[i] == check_point) // very unlikely
        s += 1;
    else if (data[i] > check_point) // samewhat likely
        s += 2;
    else if (data[i] < check_point) // highly likely
        s += 3;
}

ordered_ifs_with_hints()

for (i = 0; i < data_sz * 1024; i++) {
    if (likely(data[i] < check_point)) // highly likely
        s += 3;
    else if (data[i] > check_point) // samewhat likely
        s += 2;
    else if (unlikely(data[i] == check_point)) // very unlikely
        s += 1;
}

reversed_ifs_with_hints()

for (i = 0; i < data_sz * 1024; i++) {
    if (unlikely(data[i] == check_point)) // very unlikely
        s += 1;
    else if (data[i] > check_point) // samewhat likely
        s += 2;
    else if (likely(data[i] < check_point)) // highly likely
        s += 3;
}

data

The data array contains random numbers between 0 and 100:

const int RANGE_MAX = 100;
uint8_t data[DATA_MAX * 1024];

static void data_init(int data_sz)
{
    int i;
        srand(0);
    for (i = 0; i < data_sz * 1024; i++)
        data[i] = rand() % RANGE_MAX;
}

The Results

The following results are for Intel i5@3,2 GHz and G++ 6.3.0. The first argument is the check_point (i.e. probability in %% for the highly likely if statement), the second argument is data_sz (i.e. number of iterations).

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/50/4                    4660 ns       4658 ns     150948
ordered_ifs/50/8                   25636 ns      25635 ns      27852
ordered_ifs/75/4                    4326 ns       4325 ns     162613
ordered_ifs/75/8                   18242 ns      18242 ns      37931
ordered_ifs/100/4                   1673 ns       1673 ns     417073
ordered_ifs/100/8                   3381 ns       3381 ns     207612
reversed_ifs/50/4                   5342 ns       5341 ns     126800
reversed_ifs/50/8                  26050 ns      26050 ns      26894
reversed_ifs/75/4                   3616 ns       3616 ns     193130
reversed_ifs/75/8                  15697 ns      15696 ns      44618
reversed_ifs/100/4                  3738 ns       3738 ns     188087
reversed_ifs/100/8                  7476 ns       7476 ns      93752
ordered_ifs_with_hints/50/4         5551 ns       5551 ns     125160
ordered_ifs_with_hints/50/8        23191 ns      23190 ns      30028
ordered_ifs_with_hints/75/4         3165 ns       3165 ns     218492
ordered_ifs_with_hints/75/8        13785 ns      13785 ns      50574
ordered_ifs_with_hints/100/4        1575 ns       1575 ns     437687
ordered_ifs_with_hints/100/8        3130 ns       3130 ns     221205
reversed_ifs_with_hints/50/4        6573 ns       6572 ns     105629
reversed_ifs_with_hints/50/8       27351 ns      27351 ns      25568
reversed_ifs_with_hints/75/4        3537 ns       3537 ns     197470
reversed_ifs_with_hints/75/8       16130 ns      16130 ns      43279
reversed_ifs_with_hints/100/4       3737 ns       3737 ns     187583
reversed_ifs_with_hints/100/8       7446 ns       7446 ns      93782

Analysis

1. The Ordering Does Matter

For 4K iterations and (almost) 100% probability of highly liked statement the difference is huge 223%:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/100/4                   1673 ns       1673 ns     417073
reversed_ifs/100/4                  3738 ns       3738 ns     188087

For 4K iterations and 50% probability of highly liked statement the difference is about 14%:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/50/4                    4660 ns       4658 ns     150948
reversed_ifs/50/4                   5342 ns       5341 ns     126800

2. Number of Iterations Does Matter

The difference between 4K and 8K iterations for (almost) 100% probability of highly liked statement is about two times (as expected):

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/100/4                   1673 ns       1673 ns     417073
ordered_ifs/100/8                   3381 ns       3381 ns     207612

But the difference between 4K and 8K iterations for 50% probability of highly liked statement is 5,5 times:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/50/4                    4660 ns       4658 ns     150948
ordered_ifs/50/8                   25636 ns      25635 ns      27852

Why is so? Because of branch predictor misses. Here is the branch misses for each mentioned above case:

ordered_ifs/100/4    0.01% of branch-misses
ordered_ifs/100/8    0.01% of branch-misses
ordered_ifs/50/4     3.18% of branch-misses
ordered_ifs/50/8     15.22% of branch-misses

So on my i5 the branch predictor fails spectacularly for not-so-likely branches and large data sets.

3. Hints Help a Bit

For 4K iterations the results are somewhat worse for 50% probability and somewhat better for close to 100% probability:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/50/4                    4660 ns       4658 ns     150948
ordered_ifs/100/4                   1673 ns       1673 ns     417073
ordered_ifs_with_hints/50/4         5551 ns       5551 ns     125160
ordered_ifs_with_hints/100/4        1575 ns       1575 ns     437687

But for 8K iterations the results are always a bit better:

---------------------------------------------------------------------
Benchmark                              Time           CPU Iterations
---------------------------------------------------------------------
ordered_ifs/50/8                   25636 ns      25635 ns      27852
ordered_ifs/100/8                   3381 ns       3381 ns     207612
ordered_ifs_with_hints/50/8        23191 ns      23190 ns      30028
ordered_ifs_with_hints/100/8        3130 ns       3130 ns     221205

So, the hints also help, but just a tiny bit.

Overall conclusion is: always benchmark the code, because the results may surprise.

Hope that helps.

Andriy Berestovskyy
  • 8,059
  • 3
  • 17
  • 33
  • 2
    i5 Nehalem? i5 Skylake? Just saying "i5" is not very specific. Also, I assume you used `g++ -O2` or `-O3 -fno-tree-vectorize`, but you should say so. – Peter Cordes Nov 03 '17 at 20:00
  • Interesting that with_hints is still different for ordered vs. reversed. It would be good if you linked to the source somewhere. (e.g. a Godbolt link, preferably a full-link so link-shortening can't rot.) – Peter Cordes Nov 03 '17 at 20:25
  • 1
    The fact that the branch predictor is able to predict well even at the 4K input data size, i.e., is able to "break" the benchmark by remembering branch outcomes across a loop with a period in the _thousands_ is a testament to the power of modern branch predictors. Keep in mind that predictors are quite sensitive in some cases to things like alignment, so it's hard to draw strong conclusions about some changes. E.g., you noticed opposite behavior for the hint in different cases but it could be explained by the hint randomly changing code layout which affected the predictor. – BeeOnRope Nov 03 '17 at 23:55
  • 1
    @PeterCordes my main point is while we can try to predict outcomes of a change, still we better measure the performance before and after the change... And you are right, I should have mentioned that it was optimized with -O3 and the processor is i5-4460 @ 3.20GHz – Andriy Berestovskyy Nov 06 '17 at 16:14
30

No you should not, unless you are really sure that target system is affected. By default go by readability.

I highly doubt your results. I've modified your example a bit, so reversing execution is easier. Ideone rather consistently shows that reverse-order is faster, though not much. On certain runs even this occasionally flipped. I'd say the results are inconclusive. coliru reports no real difference as well. I can check Exynos5422 CPU on my odroid xu4 later on.

The thing is that modern CPUs have branch predictors. There is much-much logic dedicated to pre-fetching both data and instructions, and modern x86 CPUs are rather smart, when it comes to this. Some slimmer architectures like ARMs or GPUs might be vulnerable to this. But it is really highly dependent on both compiler and target system.

I would say that branch ordering optimization is pretty fragile and ephemeral. Do it only as some really fine-tuning step.

Code:

#include <chrono>
#include <iostream>
#include <random>
#include <algorithm>
#include <iterator>
#include <functional>

using namespace std;

int main()
{
    //Generate a vector of random integers from 1 to 100
    random_device rnd_device;
    mt19937 rnd_engine(rnd_device());
    uniform_int_distribution<int> rnd_dist(1, 100);
    auto gen = std::bind(rnd_dist, rnd_engine);
    vector<int> rand_vec(5000);
    generate(begin(rand_vec), end(rand_vec), gen);
    volatile int nLow, nMid, nHigh;

    //Count the number of values in each of three different ranges
    //Run the test a few times
    for (int n = 0; n != 10; ++n) {

        //Run the test again, but now sort the conditional statements in reverse-order of likelyhood
        {
          nLow = nMid = nHigh = 0;
          auto start = chrono::high_resolution_clock::now();
          for (int& i : rand_vec) {
              if (i >= 95) ++nHigh;               //Least likely branch
              else if (i < 20) ++nLow;
              else if (i >= 20 && i < 95) ++nMid; //Most likely branch
          }
          auto end = chrono::high_resolution_clock::now();
          cout << "Reverse-sorted: \t" << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << "ns" << endl;
        }

        {
          //Sort the conditional statements in order of likelyhood
          nLow = nMid = nHigh = 0;
          auto start = chrono::high_resolution_clock::now();
          for (int& i : rand_vec) {
              if (i >= 20 && i < 95) ++nMid;  //Most likely branch
              else if (i < 20) ++nLow;
              else if (i >= 95) ++nHigh;      //Least likely branch
          }
          auto end = chrono::high_resolution_clock::now();
          cout << "Sorted:\t\t\t" << chrono::duration_cast<chrono::nanoseconds>(end-start).count() << "ns" << endl;
        }
        cout << endl;
    }
}
luk32
  • 15,812
  • 38
  • 62
  • I get the same ~30% difference in performance when I switch the order of the sorted and reverse-sorted if-blocks, as was done in your code. I'm not sure why Ideone and coliru show no difference. – Carlton Oct 19 '17 at 16:57
  • Certainly interesting. I'll try to get some data for other systems, but it might take up to day till I have to play around with it. The question is interesting, especially in the light of your results, but they are so spectacular I had to cross-check it. – luk32 Oct 19 '17 at 16:59
  • If the question is _What is the effect?_ the answer cannot be **No**! – PJTraill Oct 25 '17 at 16:31
  • Yup. But I don't get notifications for updates to the original question. They made answer formulation obsolete. Sorry. I'll edit the content later, to point out it answered original question and shown some results which proved the original point. – luk32 Oct 25 '17 at 23:37
  • This is worth repeating: "By default go by readability." Writing readable code will often get you better results than trying to eek out a tiny performance boost (in absolute terms) by making your code harder for humans to parse. – Andrew Brēza Jul 05 '19 at 16:38
20

Based on some of the other answers here, it looks like the only real answer is: it depends. It depends on at least the following (though not necessarily in this order of importance):

  • Relative probability of each branch. This is the original question that was asked. Based on the existing answers, there seems to be some conditions under which ordering by probability helps, but it appears to not always be the case. If the relative probabilities are not very different, then it is unlikely to make any difference what order they are in. However, if the first condition happens 99.999% of the time and the next one is a fraction of what is left, then I would assume that putting the most likely one first would be beneficial in terms of timing.
  • Cost of calculating the true/false condition for each branch. If the time cost of testing the conditions is really high for one branch versus another, then this is likely to have a significant impact on the timing and efficiency. For example, consider a condition that takes 1 time unit to calculate (e.g., checking the state of a Boolean variable) versus another condition that takes tens, hundreds, thousands, or even millions of time units to calculate (e.g., checking the contents of a file on disk or performing a complex SQL query against a large database). Assuming the code checks the conditions in order each time, the faster conditions should probably be first (unless they are dependent on other conditions failing first).
  • Compiler/Interpreter Some compilers (or interpreters) may include optimizations of one kind of another that can affect performance (and some of these are only present if certain options are selected during compilation and/or execution). So unless you are benchmarking two compilations and executions of otherwise identical code on the same system using the exact same compiler where the only difference is the order of the branches in question, you're going to have to give some leeway for compiler variations.
  • Operating System/Hardware As mentioned by luk32 and Yakk, various CPUs have their own optimizations (as do operating systems). So benchmarks are again susceptible to variation here.
  • Frequency of code block execution If the block that includes the branches is rarely accessed (e.g., only once during startup), then it probably matters very little what order you put the branches. On the other hand, if your code is hammering away at this code block during a critical part of your code, then ordering may matter a lot (depending on benchmarks).

The only way to know for certain is to benchmark your specific case, preferably on a system identical to (or very similar to) the intended system on which the code will finally run. If it is intended to run on a set of varying systems with differing hardware, operating system, etc., then it is a good idea to benchmark across multiple variations to see which is best. It may even be a good idea to have the code be compiled with one ordering on one type of system and another ordering on another type of system.

My personal rule of thumb (for most cases, in the absence of a benchmark) is to order based on:

  1. Conditions that rely on the result of prior conditions,
  2. Cost of computing the condition, then
  3. Relative probability of each branch.
Ampersat
  • 421
  • 3
  • 6
13

The way I usually see this solved for high-performance code is keeping the order that is most readable, but providing hints to the compiler. Here is one example from Linux kernel:

if (likely(access_ok(VERIFY_READ, from, n))) {
    kasan_check_write(to, n);
    res = raw_copy_from_user(to, from, n);
}
if (unlikely(res))
    memset(to + (n - res), 0, res);

Here the assumption is that access check will pass, and that no error is returned in res. Trying to reorder either of these if clauses would just confuse the code, but the likely() and unlikely() macros actually help readability by pointing out what is the normal case and what is the exception.

The Linux implementation of those macros uses GCC specific features. It seems that clang and Intel C compiler support the same syntax, but MSVC doesn't have such feature.

jpa
  • 10,351
  • 1
  • 28
  • 45
  • 4
    This would be more helpful if you could explain how the `likely()` and `unlikely()` macros are defined, and include some information about the corresponding compiler feature. – Nate Eldredge Oct 20 '17 at 13:39
  • 1
    AFAIK, these hints "only" change the memory layout of the code blocks and whether a yes or no will lead to a jump. This may have performance advantages e.g. for the need (or lack thereof) to read memory pages. But this does not rearrange the order in which conditions within a long list of else-ifs are evaluated – Hagen von Eitzen Oct 22 '17 at 11:11
  • @HagenvonEitzen Hmm yeah, that is a good point, it cannot affect the order of `else if` if the compiler is not smart enough to know that the conditions are mutually exclusive. – jpa Oct 22 '17 at 17:39
7

Also depends on your compiler and the platform you’re compiling for.

In theory, the most likely condition should make the control jump as less as possible.

Typically the most likely condition should be first:

if (most_likely) {
     // most likely instructions
} else …

The most popular asm’s are based on conditional branches that jump when condition is true. That C code will be likely translated to such pseudo asm:

jump to ELSE if not(most_likely)
// most likely instructions
jump to end
ELSE:
…

This is because jumps make the cpu cancel the execution pipeline and stall because the program counter changed (for architectures that support pipelines which are really common). Then it’s about the compiler, which may or may not apply some sophisticated optimizations about having the statistically most probably condition to get the control make less jumps.

NoImaginationGuy
  • 1,795
  • 14
  • 24
  • 2
    You stated that the conditional branch occurs when the condition is true, but the "pseudo asm" example shows the opposite. Also, it cannot be said that conditional jumps (much less all jumps) stall the pipeline because modern CPUs typically have branch prediction. In fact, if the branch is predicted to be taken but then *not* taken, the pipeline will be stalled. I'd still try to sort the conditions in descending order of probability, but what the compiler and CPU make of it is *highly* implementation-dependent. – Arne Vogel Oct 19 '17 at 16:09
  • 1
    I put “not(most_likely)” so if most_likely is true the control will go on without jumping. – NoImaginationGuy Oct 19 '17 at 16:13
  • 2
    "The most popular asm’s are based on conditional branches that jump when condition is true".. which ISAs would that be? It's certainly not true for x86 nor for ARM. Hell for basic ARM CPUs (and very ancient x86 ones, even for complex bps they usually still start with that assumption and then adapt) the branch predictor assumes that a forward branch is *not* taken and backwards branches always are, so the opposite of the claim is true. – Voo Oct 19 '17 at 18:02
  • @Voo I guess, you misunderstood the sentence. I understand it as "usually, a conditional branch works like `if (condition) goto SOMEWHERE`", without any relation to branch prediction. It only explains the next code block. – maaartinus Oct 21 '17 at 04:21
  • @maaartinus - well it's usually the opposite: for C code like `if (condition) {// if code }; // following code` then the natural way to compile it to assembly is effectively `if (!condition) jmp following_code`. That is, the `condition == true` case is the fall-through, and the `jump` occurs if the `if` is _not_ taken. Compiling it the other way (as you suggest) would require _two_ jumps: one for the `goto SOMEWHERE` and then one to jump back from `SOMEWHERE` to the merged control flow after the `if`. This analysis doesn't apply if there is an `else` clause, however. – BeeOnRope Nov 04 '17 at 00:00
  • 1
    The compilers [I tried](https://godbolt.org/g/oVCKe8) mostly all used the approach I mentioned above for a simple test. Note that `clang` actually took a different approach for `test2` and `test3`: because of heuristics that indicate that a `< 0` or `== 0` test is likely to be false, it decided to clone the remainder of the function on both paths, so it is able to make the `condition == false` the fall through path. This is feasible only because the remainder of the function is short: in `test4` I added one more operation and it's back to the approach I outlined above. – BeeOnRope Nov 04 '17 at 00:09
  • @BeeOnRope Sure, I agree with everything what you wrote, but I also insist on what I wrote. You're explaining how a C `if` gets compiled. I was explaining, what NoImaginationGuy probably meant with his sentence. The sentence may be confusing and so may be my explanation, but be assured, we mean the same thing (and I'd bet the poster meant it too). Btw., inverting conditions and other tricks are common in Java JIT as it's profile based and the closer is the more likely path to 100%, the more funny code transformation make sense. – maaartinus Nov 04 '17 at 01:33
  • @maaartinus - OK on rereading I get it, I think. Your `if (condition) goto SOMEWHERE` was talking about how a conditional jump at the asm level works logically, and it's correct. I thought that snippet was "C" and implied at at the asm level when the condition was true there was usually a `jmp` but I see that's not the case. I never really read the answer but now it's clear NoImG is saying the same thing too. I'll at least leave my comments there because maybe the godbolt stuff it interesting (like how clang prefers positive values, etc). – BeeOnRope Nov 04 '17 at 01:38
  • 1
    @ArneVogel - correctly predicted taken branches don't totally stall the pipeline on modern CPUs but they are still often significantly worse than not taken: (1) they mean the control flow is not contiguous so the rest of the instructions after the `jmp` aren't useful so fetch/decode bandwidth is wasted (2) even with prediction modern big cores only do one fetch per cycle so it puts a hard limit of 1 taken branch/cycle (OTOH modern Intel can do 2 not-taken/cycle) (3) it's harder for branch prediction to deal with taken consecutive branches and in the case of fast + slow predictors... – BeeOnRope Nov 04 '17 at 01:43
  • ... the slow one might fall behind and you could suffer "mini" mispredicts (front-end resteers or whatever they are called) even if the predictor ultimately generated the correct prediction (4) "not taken" control flow means all the executed code is contiguous which is best for i-cache usage (including uop cache & loop buffer), instruction prefetching, instruction fetch latency & bandwidth. – BeeOnRope Nov 04 '17 at 01:46
  • @BeeOnRope This is what I meant. Even if branch prediction is the most sophisticated you can have in the world, it’s always better at hardware level that instructions go on in sequence. A not predicted branch taken could take 1 or even 2 clock cycles (depending on the implementation of branches). Then what I meant it’s that usually the compiler does the trick, and I put that pseudo-asm that jumps when the condition is not taken (which is not even that rare as you said tho). Then you put up problems about instruction cache, which is a good topic too. – NoImaginationGuy Nov 04 '17 at 03:50
6

I decided to rerun the test on my own machine using Lik32 code. I had to change it due to my windows or compiler thinking high resolution is 1ms, using

mingw32-g++.exe -O3 -Wall -std=c++11 -fexceptions -g

vector<int> rand_vec(10000000);

GCC has made the same transformation on both original codes.

Note that only the two first conditions are tested as the third must always be true, GCC is a kind of a Sherlock here.

Reverse

.L233:
        mov     DWORD PTR [rsp+104], 0
        mov     DWORD PTR [rsp+100], 0
        mov     DWORD PTR [rsp+96], 0
        call    std::chrono::_V2::system_clock::now()
        mov     rbp, rax
        mov     rax, QWORD PTR [rsp+8]
        jmp     .L219
.L293:
        mov     edx, DWORD PTR [rsp+104]
        add     edx, 1
        mov     DWORD PTR [rsp+104], edx
.L217:
        add     rax, 4
        cmp     r14, rax
        je      .L292
.L219:
        mov     edx, DWORD PTR [rax]
        cmp     edx, 94
        jg      .L293 // >= 95
        cmp     edx, 19
        jg      .L218 // >= 20
        mov     edx, DWORD PTR [rsp+96]
        add     rax, 4
        add     edx, 1 // < 20 Sherlock
        mov     DWORD PTR [rsp+96], edx
        cmp     r14, rax
        jne     .L219
.L292:
        call    std::chrono::_V2::system_clock::now()

.L218: // further down
        mov     edx, DWORD PTR [rsp+100]
        add     edx, 1
        mov     DWORD PTR [rsp+100], edx
        jmp     .L217

And sorted

        mov     DWORD PTR [rsp+104], 0
        mov     DWORD PTR [rsp+100], 0
        mov     DWORD PTR [rsp+96], 0
        call    std::chrono::_V2::system_clock::now()
        mov     rbp, rax
        mov     rax, QWORD PTR [rsp+8]
        jmp     .L226
.L296:
        mov     edx, DWORD PTR [rsp+100]
        add     edx, 1
        mov     DWORD PTR [rsp+100], edx
.L224:
        add     rax, 4
        cmp     r14, rax
        je      .L295
.L226:
        mov     edx, DWORD PTR [rax]
        lea     ecx, [rdx-20]
        cmp     ecx, 74
        jbe     .L296
        cmp     edx, 19
        jle     .L297
        mov     edx, DWORD PTR [rsp+104]
        add     rax, 4
        add     edx, 1
        mov     DWORD PTR [rsp+104], edx
        cmp     r14, rax
        jne     .L226
.L295:
        call    std::chrono::_V2::system_clock::now()

.L297: // further down
        mov     edx, DWORD PTR [rsp+96]
        add     edx, 1
        mov     DWORD PTR [rsp+96], edx
        jmp     .L224

So this doesn't tell us much except that the last case doesn't need a branch predict.

Now I tried all 6 combinations of the if's, the top 2 are the original reverse and sorted. high is >= 95, low is < 20, mid is 20-94 with 10000000 iterations each.

high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 44000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 45000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 46000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 43000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 48000000ns
high, mid, low: 44000000ns
low, mid, high: 44000000ns
mid, high, low: 45000000ns
low, high, mid: 45000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 47000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 46000000ns
low, high, mid: 44000000ns

high, low, mid: 43000000ns
mid, low, high: 46000000ns
high, mid, low: 45000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns

high, low, mid: 42000000ns
mid, low, high: 46000000ns
high, mid, low: 44000000ns
low, mid, high: 45000000ns
mid, high, low: 45000000ns
low, high, mid: 44000000ns

1900020, 7498968, 601012

Process returned 0 (0x0)   execution time : 2.899 s
Press any key to continue.

So why is the order high, low, med then faster (marginally)

Because the most unpredictable is last and therefore is never run through a branch predictor.

          if (i >= 95) ++nHigh;               // most predictable with 94% taken
          else if (i < 20) ++nLow; // (94-19)/94% taken ~80% taken
          else if (i >= 20 && i < 95) ++nMid; // never taken as this is the remainder of the outfalls.

So the branches will be predicted taken, taken and remainder with

6%+(0.94*)20% mispredicts.

"Sorted"

          if (i >= 20 && i < 95) ++nMid;  // 75% not taken
          else if (i < 20) ++nLow;        // 19/25 76% not taken
          else if (i >= 95) ++nHigh;      //Least likely branch

The branches will be predicted with not taken, not taken and Sherlock.

25%+(0.75*)24% mispredicts

Giving 18-23% difference (measured difference of ~9%) but we need to calculate cycles instead of mispredicting %.

Let's assume 17 cycles mispredict penalty on my Nehalem CPU and that each check takes 1 cycle to issue (4-5 instructions) and the loop takes one cycle too. The data dependencies are the counters and the loop variables, but once the mispredicts are out of the way it shouldn't influence the timing.

So for "reverse", we get the timings (this should be the formula used in Computer Architecture: A Quantitative Approach IIRC).

mispredict*penalty+count+loop
0.06*17+1+1+    (=3.02)
(propability)*(first check+mispredict*penalty+count+loop)
(0.19)*(1+0.20*17+1+1)+  (= 0.19*6.4=1.22)
(propability)*(first check+second check+count+loop)
(0.75)*(1+1+1+1) (=3)
= 7.24 cycles per iteration

and the same for "sorted"

0.25*17+1+1+ (=6.25)
(1-0.75)*(1+0.24*17+1+1)+ (=.25*7.08=1.77)
(1-0.75-0.19)*(1+1+1+1)  (= 0.06*4=0.24)
= 8.26

(8.26-7.24)/8.26 = 13.8% vs. ~9% measured (close to the measured!?!).

So the obvious of the OP is not obvious.

With these tests, other tests with more complicated code or more data dependencies will certainly be different so measure your case.

Changing the order of the test changed the results but that could be because of different alignments of the loop start which should ideally be 16 bytes aligned on all newer Intel CPUs but isn't in this case.

Farzad Karimi
  • 770
  • 1
  • 12
  • 31
Surt
  • 15,501
  • 3
  • 23
  • 39
4

Put them in whatever logical order you like. Sure, the branch may be slower, but branching should not be the majority of work your computer is doing.

If you are working on a performance critical portion of code, then certainly use logical order, profile guided optimization and other techniques, but for general code, I think its really more of a stylistic choice.

Jack
  • 94
  • 2
  • 6
    Branch prediction failures are expensive. In microbenchmarks, they are *under costed*, because x86s have a large table of branch predictors. A tight loop over the same conditions results in the CPU knowing better than you do which one is most likely. But if you have branches all over your code, you can have your branch prediction cache run out of slots, and the cpu assumes whatever is default. Knowing what that default guess is can save cycles all over your code base. – Yakk - Adam Nevraumont Oct 19 '17 at 17:29
  • @Yakk Jack's answer is the only correct one here. Do not make optimizations that reduce readability if your compiler is able to do that optimization. You wouldn't do constant folding, dead code elimination, loop unrolling or any other optimization if your compiler does it for you, would you? Write your code, use profile guided optimization (which is design to solve this issue because coders suck at guessing) and then see if your compiler optimizes it or not. In the end you don't want to have any branchess in performance critical code anyway. – Christoph Diegelmann Oct 20 '17 at 07:20
  • @Christoph I wouldn't include code I knew to be dead. I wouldn't use `i++` when `++i` would do, because I'm aware that `i++` for some iterators is hard to optimize down to `++i` and the difference (for me) doesn't matter. This is about avoiding pessimization; putting the most likely block first as a *default habit* won't cause a noticable readability reduction (and might actually help!), while resulting in code that is branch-prediction friendly (and thus giving you a uniform small performance boost that cannot be recaptured by later micro optimization) – Yakk - Adam Nevraumont Oct 20 '17 at 13:28
3

If you already know the relative probability of if-else statement,then for performance purpose it would be better to use the sorted way, as it will only check one condition(the true one).

In an unsorted way the compiler will check all the conditions unnecessarily and will take time.

aditya rawat
  • 154
  • 10