Why does a preassigned function pointer perform worse than a branch?

Question

I have a class with an enum member variable. One of the member functions bases its behavior on this enum so as a "possible" optimization, I have the two different behaviors as two different functions and I give the class a member function pointer which is set at construction. I simulated this situation like this:

enum catMode {MODE_A, MODE_B};

struct cat
{
    cat(catMode mode) : stamp_(0), mode_(mode) {}

    void
    update()
    {
        stamp_ = (mode_ == MODE_A) ? funcA() : funcB();
    }

    uint64_t stamp_;
    catMode  mode_;
};

struct cat2
{
    cat2(catMode mode) : stamp_(0), mode_(mode)
    {
        if (mode_ = MODE_A)
            func_ = funcA;
        else
            func_ = funcB;
    }

    void
    update()
    {
        stamp_ = func_();
    }

    uint64_t stamp_;
    catMode  mode_;
    uint64_t (*func_)(void);
};

And then I create a cat object and an array of length 32. I traverse the array to bring it into cache, then I call cats update method 32 times and store the latency using rdtsc in the array...

Then I call a function which loops several hundred times using rand(), ulseep(), and some arbitrary strcmp()..come back and I do the 32 thing again.

The result is that the method with the branch seems to always be around 44 +/- 10 cycles whereas the one with the function pointer tends to be around 130. I'm curious as to why this would be the case?

If anything, I would have expected similar performance. Also, templating is hardly an option because full specialization of the real cat class for that one function would be overkill.

@sixlettervariables - possible more that the CPU has branch prediction while a function pointer call involves clearing stack/register. You wouldn't necessarily see this from the assem — Martin Beckett, Jul 25 '12 at 14:38
depending on `funcA` and `funcB`s complexities, they could have been inlined. I agree with @sixlettervariables that checking the assembly is a good idea. It's easy to do, and if nothing else, at least you'll rule out some possible explanations. — Sander De Dycker, Jul 25 '12 at 14:49
You can avoid full specialization when you use a templated struct to wrap the specific function. Then you only need to specialize the struct and the cat class needs to be specified only once. — tokage, Jul 25 '12 at 14:56
What processor is this? Also, can you provide a complete SSCCE so I can run this myself? — Mysticial, Jul 25 '12 at 15:07
@tokage wait that is interesting, where can I read about this or how does this work? — Palace Chan, Jul 25 '12 at 22:09
@Palace Chan [example at ideone](http://ideone.com/GnWRX), of course you can use this only if you can template the Cat class — tokage, Jul 26 '12 at 09:37

score 4 · Accepted Answer · edited May 23 '17 at 12:13

4

Without a complete SSCCE I can't approach this the same way that I usually do with such questions.
So the best I can do is speculate:

The core difference between your two cases is that you have a branch vs. a function pointer. The fact that you are seeing a difference at all strongly hints that funcA() and funcB() are very small functions.

Possibility #1:

In the branch version of the code, funcA() and funcB() are being inlined by the compiler. Not only does that skip the function call overhead, but if the functions are trivial enough, the branch could also be completely optimized out as well.

Function pointers, on the other hand, cannot be inlined unless the compiler can resolve them at compile-time.

Possibility #2:

By comparing a branch against a function-pointer, you are putting the branch-predictor against the branch target predictor.

Branch target prediction is not the same as branch prediction. In the branch case, the processor needs to predict which way to branch. In the function pointer case, it needs to predict where to branch to.

It's very likely that your processor's branch predictor is much more accurate than its branch target predictor. But then again, this is all guesswork...

edited May 23 '17 at 12:13

Community

1
1

answered Jul 25 '12 at 15:31

Mysticial

464,885
45
335
332

Thanks! I wish I had seen this before. What is a complete SSCCE? Yea funcA and funcB are relatively small. It is a 2 cpu intel Xeon with 6 cores per cpu. – Palace Chan Jul 25 '12 at 22:06
SSCCE = Short Self Contained Compilable Example, so basically a small snippet that reproduces the problem that you're having. – Mysticial Jul 25 '12 at 23:05
what is a good way/resource to learn the strong points, latencies, and such of your processor? For example Xeon, or Sandy Bridge? I think it would help make better (albeit nonportable) optimization decisions. – Palace Chan Jul 26 '12 at 21:45
[Agner Fog's blog](http://www.agner.org/optimize/) has some very good resources. He's got large table of latencies of the majority of current x86 processors. But be aware that performance is a lot more complicated than just adding up latencies. So you'll have to do a lot of testing and experimentation. – Mysticial Jul 26 '12 at 21:52

Why does a preassigned function pointer perform worse than a branch?

1 Answers1