How to make sure to avoid branch misprediction when calling a method using one of two methods based on a boolean

Question

Let's say you have a call to a method that calculates a value and returns it :

double calculate(const double& someArg);

You implement another calculate method that has the same profile as the first one, but works differently :

double calculate2(const double& someArg);

You want to be able to switch from one to the other based on a boolean setting, so you end up with something like this :

double calculate(const double& someArg)
{
  if (useFirstVersion) // <-- this is a boolean
    return calculate1(someArg); // actual first implementation
  else
    return calculate2(someArg); // second implementation
}

The boolean might change during runtime but it is quite rare.

I notice a small but noticeable performance hit that I suppose is due to either branch misprediction or cache unfriendly code.

How to optimize it to get the best runtime performances ?

My thoughts and attempts on this issue :

I tried using a pointer to function to make sure to avoid branch mispredictions :

The idea was when the boolean changes, I update the pointer to function. This way, there is no if/else, we use the pointer directly :

The pointer is defined like this :

double (ClassWeAreIn::*pCalculate)(const double& someArg) const;

... and the new calculate method becomes like this :

double calculate(const double& someArg)
{
  (this->*(pCalculate))(someArg);
}

I tried using it in combination with __forceinline and it did make a difference (which I am unsure if that should be expected as the compiler should have done it already ?). Without __forceline it was the worst regarding performances, and with __forceinline, it seemed to be much better.

I thought of making calculate a virtual method with two overrides but I read that virtual methods are not a good way to optimize code as we still have to find the right method to call at runtime. I did not try it though.

However, whichever modifications I did, I never seemed to be able to restore the original performances (maybe it is not possible ?). Is there a design pattern to deal with this in the most optimal way (and possibly the cleaner/easier to maintain the better) ?

A complete example for VS :

main.cpp

#include "stdafx.h"
#include "SomeClass.h"
#include <time.h>
#include <stdlib.h>
#include <chrono>
#include <iostream>

int main()
{
  srand(time(NULL));

  auto start = std::chrono::steady_clock::now();

  SomeClass someClass;

  double result;

  for (long long i = 0; i < 1000000000; ++i)
    result = someClass.calculate(0.784542);

  auto end = std::chrono::steady_clock::now();

  std::chrono::duration<double> diff = end - start;

  std::cout << diff.count() << std::endl;

  return 0;
}

SomeClass.cpp

#include "stdafx.h"
#include "SomeClass.h"
#include <math.h>
#include <stdlib.h>

double SomeClass::calculate(const double& someArg)
{
  if (useFirstVersion)
    return calculate1(someArg);
  else
    return calculate2(someArg);
}

double SomeClass::calculate1(const double& someArg)
{
  return asinf((rand() % 10 + someArg)/10);
}

double SomeClass::calculate2(const double& someArg)
{    
  return acosf((rand() % 10 + someArg) / 10);
}

SomeClass.h

#pragma once
class SomeClass
{
public:
  bool useFirstVersion = true;

  double calculate(const double& someArg);
  double calculate1(const double& someArg);
  double calculate2(const double& someArg);
};

(I did not include the ptr to function in the example since it only seems to make things worse).

Using the example above, I get an average of 14,61s to run it when calling directly calculate1 in the main, whereas I get an average of 15,00s to run when calling calculate0 (with __forceinline, which seems to make the gap smaller).

No it is boolean that can change at runtime (although it rarely does). — Norgannon, Oct 30 '18 at 21:23
With pointer function, you trade branch prediction with jump prediction. — Jarod42, Oct 30 '18 at 21:24
"How to make sure to avoid branch misprediction when calling a method" - There is *no way* to *guarantee* that the CPU will not mis-predict a branch, ever. — Jesper Juhl, Oct 30 '18 at 21:25
What I mean is that if there is a way to avoid having two branches using some trick or design pattern it is what I am looking for. — Norgannon, Oct 30 '18 at 21:26
@Norgannon It would be cool if you could post a [MCVE] that demonstrates your problem, and can be used by everyone to play around with what you have. — πάντα ῥεῖ, Oct 30 '18 at 21:27
Also [this](https://stackoverflow.com/questions/1440570/likely-unlikely-equivalent-for-msvc) might help. — πάντα ῥεῖ, Oct 30 '18 at 21:29
I just added an complete minimal example that should reproduce the issue. — Norgannon, Oct 30 '18 at 21:49
I think you should move `auto start = std::chrono::steady_clock::now();` from before the `srand(time(NULL))` [which has syscall overhead] to immediately before the `for` loop. The small differences in cache or branch prediction might be swamped. Also, if you vary the `for` loop iteration limit [up or down], what happens? You might be measuring the effect of time slicing rather than your algorithm if the iteration count is too high. — Craig Estey, Oct 30 '18 at 21:56
I would expect your example to be compiled to something that does not measure what you think it does when optimizations are turned on and without optimizations its pointless to compare the performance. Do you enable compiler optimizations? — 463035818_is_not_an_ai, Oct 30 '18 at 22:00
what i mean is that `for (long long i = 0; i < 1000000000; ++i) result = someClass.calculate(0.784542);` is the same as if `result = someClass.calculate(0.784542);` and a clever compiler might realize that — 463035818_is_not_an_ai, Oct 30 '18 at 22:04
On x86, you may want to use the hardware's `PEBS` performance monitoring. See: https://stackoverflow.com/questions/44218275/good-resources-on-how-to-program-pebs-precise-event-based-sampling-counters — Craig Estey, Oct 30 '18 at 22:05
Yes I had issues because the compiler wasn't computing it at the start, which is why calculate does call rand(). Since I added it to the example, it does seem to work fine. And about the clock I will update the code in the example but I don't think it changes much — Norgannon, Oct 30 '18 at 22:06
I was looking up some other question more or less related to this one and I found [this question](https://softwareengineering.stackexchange.com/questions/301510/in-general-is-it-worth-using-virtual-functions-to-avoid-branching) on the software engineering stack exchange which does address some of my concerns, although it is slightly less specific than my own question — Norgannon, Oct 31 '18 at 10:36
According to [this question](https://stackoverflow.com/questions/10757167/do-function-pointers-force-an-instruction-pipeline-to-clear) (and its answers), using a function pointer or a virtual method is quite similar if not the same performence-wise (and seems like a terrible solution for small functions). — Norgannon, Oct 31 '18 at 13:27

score 0 · Accepted Answer · answered Nov 01 '18 at 02:04

Since useFirstVersion rarely changes, the execution path of calculate is very easily predictable by most branch prediction techniques. Performance degrades a bit because of the extra code necessary to implement the if/else logic. It also depends on whether the compiler inlines calculate, calculate1, or calculate2. Ideally, all of them should be inlined, although this is less likely to happen compared to calling calculate1 or calculate2 directly because the code size is larger. Note that I have not tried to reproduce your results, but there is nothing particularly suspicious about the 3% performance degradation. If you can make useFirstVersion so that it never changes dynamically, then you can turn it into a macro. Otherwise, the idea of calling calculate through a function pointer would eliminate most of the performance overhead. By the way, I don't think MSVC can inline calls through function pointers, yet these functions are good candidates for inlining.

score 0 · Answer 2 · answered Nov 25 '18 at 21:58

In the end, if you are in the same situation as I was, I would advise the following :

Don't worry about branch misprediction if the right prediction rarely ever changes.

The cost seems to be marginal although I can't really provide exact figures to back it up.

The cost of the overhead of the new intermediary methods can be mitigated by __force inline in VC++

I am able to notice the difference and it was in the end the best way to avoid degrading performances. Only go this way if the methods you are inlining are small, like simple getters & such. I don't know why my compiler wouldn't choose to inline the methods by itself, but the __force inline actually made the trick (even though you cannot be sure the compiler will inline the methods as __force inline is only a suggestion to the compiler).

How to make sure to avoid branch misprediction when calling a method using one of two methods based on a boolean

2 Answers2