High-performance code in c++ (inheritance, pointers to functions, if)

Question

Suppose you have a very large graph with lots of processing upon its nodes (like tens of milions of operations per node). The core routine is the same for each node, but there are some additional operations based on internal conditions. There can be 2 such conditions which produces 4 cases (0,0), (1,0), (0,1), (1,1). E.g. (1,1) means that both conditions hold. Conditions are established once (one set for each node independently) in a program and, from then on, never change. Unfortunately, they are determined in runtime and in a fully unpredictable way (based on data received via HTTP from and external server).

What is the fastest in such scenario? (taken into account modern compiler optimizations which I have no idea of)

simply using "IFs": if (condition X) perform additional operation X.
using inheritance to derive four classes from base class (exposing method OPERATION) to have a proper operation and save tens milions of "ifs". [but I am not sure if this is a real saving, inheritance must have its overhead too)
use pointer to function to assign the function based on conditions once.

I would take me long to come to a point where I can test it by myself (I don't have such a big data yet and this will be integrated into bigger project so would not be easy to test all versions).

Reading answers: I know that I probably have to experiment with it. But apart from everything, this is sort of a question what is faster:

tens of milions of IF statements and normal statically known function calls VS function pointer calls VS inheritance which I think is not the best idea in this case and I am thinking of eliminating it from further inspection Thanks for any constructive answers (not saying that I shouldn't care about such minor things ;)

Related http://stackoverflow.com/q/11957632/332733 Also this question is WAY too broad, please narrow it down to something that can actually be answered — Mgetz, Nov 14 '13 at 13:47
Take a look here : http://stackoverflow.com/questions/1955074/virtual-methods-or-function-pointers Function pointer should be faster than a virtual method — benjarobin, Nov 14 '13 at 13:48
Could you also consider a switch() statement as an option for your case? When you have more than 2 choices, a switch is often faster than a series of IFs. — BlueMonkMN, Nov 14 '13 at 13:52
@BlueMonkMN Functions pointers or virtual functions can be slower than `if` statements. With `if` statements the CPU can employ branch prediction and speculative execution. — Maxim Egorushkin, Nov 14 '13 at 13:52
@MaximYegorushkin if you're at the point of worrying about pipeline stalls... you should have solved all the algorithmic issues first. This question stinks if prior-optimization. — Mgetz, Nov 14 '13 at 13:55
It depends on the architecture, and the compiler. Generally (at least in the types of applications I've profiled, on the machines I've used), derivation is faster than using `if` or `case`, but it's certainly not an absolute rule. (I've been told, for example, that HP's PA handles indirect jumps particularly ineffectively.) — James Kanze, Nov 14 '13 at 13:55
@MaximYegorushkin It depends on the machine, and the context. In the cases I've measured, virtual functions were faster than `if`s. (But from what I've heard about other machines, this isn't always the case.) — James Kanze, Nov 14 '13 at 13:57
@JamesKanze True, that's why there is no substitute for benchmarking. — Maxim Egorushkin, Nov 14 '13 at 14:10
It's not prior-optimization. OP asks if an obvious (maybe even not compiler-related) overhead exists in some case. — TT_ stands with Russia, Nov 14 '13 at 14:36
It's quite possible that you spend weeks benchmarking all possibilities and discover that the performance is exactly the same in all cases! I'd go with whatever was easiest to read and understand. — Aaron McDaid, Nov 14 '13 at 15:21

score 3 · Answer 1 · answered Nov 14 '13 at 14:03

3

There is no real answer except to measure the actual code on the real data. At times, in the past, I've had to deal with such problems, and in the cases I've actually measured, virtual functions were faster than if's. But that doesn't mean much, since the cases I measured were in a different program (and thus a different context) than yours. For example, a virtual function call will generally prevent inlining, whereas an if is inline by nature, and inlining may open up additional optimization possibilities.

Also the machines I measured on handled virtual functions pretty well; I've heard that some other machines (HP's PA, for example) are very ineffective in their implementation of indirect jumps (including not only virtual function calls, but also the return from the function---again, the lost opportunity to inline costs).

answered Nov 14 '13 at 14:03

James Kanze

150,581
18
184
329

BTW, there is a relatively new gcc command line switch [`-fdevirtualize-speculatively`](http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html#index-fdevirtualize-799). – Maxim Egorushkin Nov 14 '13 at 14:42
What do you do if you don't have the resources to measure actual code. Just pick the most readable method until you can test performance? Is it even worth guessing at a solution that might perform better, assuming it's not confusing? – BlueMonkMN Nov 14 '13 at 16:27
@BlueMonkMN Use the most natural method until you know you have a performance problem, and the profiler tells you where. (And I know that in this case, that's an awkward answer, since changing the strategy will probably require rewriting large blocks of code. But it is, impossible to say which strategy will be faster without measuring, in real life conditions. Even things like the number of instances of each type can affect the results.) – James Kanze Nov 14 '13 at 17:08

wich · Answer 2 · 2013-11-14T15:06:44.600

If you absolutely have to have the fastest way, and the process order of the nodes is not relevant, make four different types, one for each case, and define a process function for each. Then in a container class, have four vectors, one for each node type. Upon creation of a new node get all the data you need to create the node, including the conditions, and create a node of the correct type and push it into the correct vector. Then, when you need to process all your nodes, process them type by type, first processing all nodes in the first vector, then the second, etc.

Why would you want to do this:

No ifs for the state switching
No vtables
No function indirection

But much more importantly:

No instruction cache thrashing (you're not jumping to a different part of your code for every next node)
No branch prediction misses for state switching ifs (since there are none)

Even if you'd have inheritance with virtual functions and thus function indirection through vtables, simply sorting the nodes by their type in your vector may already make a world of difference in performance as any possible instruction cache thrashing would essentially be gone and depending on the methods of branch prediction the branch prediction misses could also be reduced.

Also, don't make a vector of pointers, but make a vector of objects. If they are pointers you have an extra adressing indirection, which in itself is not that worrisome, but the problem is that it may lead to data cache thrashing if the objects are pretty much randomly spread throughout your memory. If on the other hand your objects are directly put into the vector the processing will basically go through memory linearly and the cache will basically hit every time and cache prefetching might actually be able to do a good job.

Note though that you would pay heavily in data structure creation if you don't do it correctly, if at all possible, when making the vector reserve enough capacity in it immediately for all your nodes, reallocating and moving every time your vector runs out of space can become expensive.

Oh, and yes, as James mentioned, always, always measure! What you think may be the fastest way may not be, sometimes things are very counter intuitive, depending on all kinds of factors like optimizations, pipelining, branch prediction, cache hits/misses, data structure layout, etc. What I wrote above is a pretty general approach, but it is not guaranteed to be the fastest and there are definitely ways to do it wrong. Measure, Measure, Measure.

P.S. inheritance with virtual functions is roughly equivalent to using function pointers. Virtual functions are usually implemented by a vtable at the head of the class, which is basically just a table of function pointers to the implementation of the given virtual for the actual type of the object. Whether ifs is faster than virtuals or the other way around is a very very difficult question to answer and depends completely on the implementation, compiler and platform used.

Glenn Teitelbaum · Answer 3 · 2013-11-14T14:38:44.917

I'm actually quite impressed with how effective branch prediction can be, and only the if solution allows inlining which also can be dramatic. Virtual functions and pointer to function also involve loading from memory and could possibly cause cache misses

But, you have four conditions so branch misses can be expensive.

Without the ability to test and verify the answer really can't be answered. Especially since its not even clear that this would be a performance bottleneck sufficient enough to warrant optimization efforts.

In cases like this. I would err on the side of readability and ease of debugging and go with if

score 1 · Answer 4 · edited May 23 '17 at 11:58

Many programmers have taken classes and read books that go on about certain favorite subjects: pipelining, cacheing, branch prediction, virtual functions, compiler optimizations, big-O algorithms, etc. etc. and the performance of those.

If I could make an analogy to boating, these are things like trimming weight, tuning power, adjusting balance and streamlining, assuming you are starting from some speedboat that's already close to optimal.

Never mind you may actually be starting from the Queen Mary, and you're assuming it's a speedboat.

It may very well be that there are ways to speed up the code by large factors, just by cutting away fat (masquerading as good design), if only you knew where it was. Well, you don't know where it is, and guessing where is a waste of time, unless you value being wrong.

When people say "measure and use a profiler" they are pointing in the right direction, but not far enough. Here's an example of how I do it, and I made a crude video of it, FWIW.

score 0 · Answer 5 · answered Nov 14 '13 at 15:17

Unless there's a clear pattern to these attributes, no branch predictor exists that can effectively predict this data-dependent condition for you. Under such circumstances, you may be better off avoiding a control speculation (and paying the penalty of a branch misprediction), and just wait for the actual data to arrive and resolve the control flow (more likely to occur using virtual functions). You'll have to benchmark of course to verify that, as it depends on the actual pattern (if for e.g. you have even small groups of similarly "tagged" elements).

The sorting suggested above is nice and all, but note that it converts a problem that's just plain O(n) into an O(logn) one, so for large sizes you'll lose unless you can sort once - traverse many times, or otherwise cheaply maintain the sort state.

Note that some predictors may also attempt to predict the address of the function call, so you might be facing the same problem there.

However, I must agree about the comments regarding early-optimizations - do you know for sure that the control flow is your bottleneck? What if fetching the actual data from memory takes longer? In general, it would seem that your elements can be process in parallel, so even if you run this on a single thread (and much more if you use multiple cores) - you should be bandwidth-bound and not latency bound.

High-performance code in c++ (inheritance, pointers to functions, if)

5 Answers5