Is there any template usage can optimize out the branch?

Question

The class A has a template member function A::runImpl. The function A::run calls a specialized implementation based on the value of A::m_case, which is set within the class constructor.

In my project, the run function is called very frequently. It will speedup over 5% if the branch within it can be eliminated. Is there any template usage can do this? My project is compiled by GCC 7.3.1 with C++14.

#include <iostream>
#include <cstdlib>
#include <cassert>
using namespace std;

class A {
public:
  A (uint32_t * arr, size_t len) : m_case(0) {
    for (size_t i = 0; i < len; ++i) {
      m_case += arr[i];
    }
  }

  template <size_t> void runImpl() { assert(0); };
  void run();

private:
  size_t m_case;
};

template <>
inline void A::runImpl<0>() {
  cout << "Default execution path." << endl;
}

template <>
inline void A::runImpl<1>() {
  cout << "Optimized execution path 1." << endl;
}

template <>
inline void A::runImpl<2>() {
  cout << "Optimized execution path 2." << endl;
}

void A::run() {
  switch (m_case) {
    case 1:
      runImpl<1>();
      break;
    case 2:
      runImpl<2>();
      break;
    default:
      runImpl<0>();
      break;
  }
}

int main() {
  uint32_t arr[] = {1, 1};
  A a(arr, 2);
  a.run();
  return 0;
}

Is `m_case` something that will be known at compile time or can it be set at runtime? — NathanOliver, Feb 23 '23 at 16:59
Some template trick would require compile time constants. Is the array provided to the constructor `constexpr`? If not, you are caught in runtime already and templates are out. You *could* then, though, do the switch-case already in the constructor and store the address of the function to be called in a member function pointer. Then `run` would just call the function stored in there. — Aconcagua, Feb 23 '23 at 16:59
@NathanOliver The `m_case` is the sum of an arrary, which is an arg. of the constructor. — Hovin, Feb 23 '23 at 17:02
Yes, but is that a value you will know at compile time? It is in your example but that might just be because it's example code. — NathanOliver, Feb 23 '23 at 17:03
You might trade switch case with function pointer, but branch prediction is generally better than jump predictor... — Jarod42, Feb 23 '23 at 17:06
@Aconcagua In the way you sugguested, the branch will goes to a function call. In my example, all implementations of `runImpl` are `inline`. I have tested that a function call within `run` is more expensive then branch presented in my example. — Hovin, Feb 23 '23 at 17:10
Okay. I will throw my hat in with the other and suggest you use a function pointer and call the function through the pointer. It may or may not give you an improvement. — NathanOliver, Feb 23 '23 at 17:10
[This one](https://stackoverflow.com/questions/938518/how-to-store-goto-labels-in-an-array-and-then-jump-to-them) might be pretty interesting for you – though you rely on a non-standard solution then. Have an array with three labels to jump to and add `m_state = m_state > 3 ? 0 : m_state;` to your constructor... — Aconcagua, Feb 23 '23 at 17:22
Side note: About [`using namespace std`](https://stackoverflow.com/questions/1452721/why-is-using-namespace-std-considered-bad-practice)... — Aconcagua, Feb 23 '23 at 17:24

Aconcagua · Answer 1 · 2023-02-23T17:51:22.993

Non-portable solution!!!

You can try to optimise with your own jump table (adopted from this answer); this relies on a non-standard feature of GCC (and clang) that allows to take the address of a label; to be able to profit from you need to make sure that m_state is within proper range of [0; 2], so you would add

m_state = m_state < 3 ? m_state : 0;

to your constructor (or use a separate variable, if you still need the original value for other purposes).

Then your run function can look like this:

void A::run()
{
    static void* levels[] = { &&LEVEL0, &&LEVEL1, &&LEVEL2 };
    goto *levels[m_state];

    LEVEL0:
    runImpl<0>();
    return;
    LEVEL1:
    runImpl<1>();
    return;
    LEVEL2:
    runImpl<2>();
    return;
}

If this is indeed faster than your current switch/case solution you need to measure on your own...

Demonstration on godbolt.

To still retain portability you might detect the compiler according to this site and enable this optimisation for those compilers you know support it and fall back to the switch statement for the others.

score 0 · Answer 2 · answered Feb 23 '23 at 21:40

You could try using cmov instruction to avoid using branches. However, cmov doesn't necessarily make your code run faster. I tried running your code with/without the cmove instruction and interestingly, the use of cmov degrades performance.

Without cmov: Optimized execution path 2. ->Time: 0.000029

With cmov: Optimized execution path 2. ->Time: 0.000038

Try compiling your code with/without these flags to enable/disable using cmov:

-O3 -g -fno-if-conversion -fno-if-conversion2 -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -fno-inline

Is there any template usage can optimize out the branch?

2 Answers2

Non-portable solution!!!