VTune profiling shows no metrics for branch prediction on polymorphic function?

Question

I am analyzing the different between two designs which process millions of messages. One design uses polymorphism and the other doesnt- each message will be represented by a polymorphic sub type.

I have profiled both designs using VTune. The High-level summary data seems to make sense- the polymorphic design has a higher "branch mispredict" rate, higher CPI and higher "ICache misses" rate than the non-polymorphic version implemented with IF statements.

The polymorphic design has a line of source code like this:

object->virtualFunction();

and this is called millions of times (where the sub type changes each time). I am expecting the polymorphic design to be slower because of branch target mispredictions/instruction misses. As said above, the VTune "summary" tab seems to confirm this. However, when I go to the metrics next to the line of source code there are absolutely no metrics except for:

Filled pipeline slots total -> Retiring -> General retirement
Filled pipeline slots self -> Retiring -> General retirement
Unfilled pipeline slots total -> Front end bound -> Front end bandwidth -> Front end bandwidth MITE
Unfilled pipeline slots self -> Front end bound -> Front end bandwidth -> Front end bandwidth MITE

None of the branch prediction columns have data, nor do the instruction cache miss columns??

Could somebody please comment on whether this seems sensible? To me it doesn't- how can there be no branch misprediction or instruction cache miss statistics for a line of polymorphic code where the branch target will constantly be changing per message?

This cannot be due to compiler optimizations/inlining because the compiler wouldn't know the subtype of the object to optimize.

How should I profile the overhead of the polymorphism using VTune?

What fraction of time (wall-time) is actually spent with the program counter in that statement? It is a method call, meaning that unless the method does almost nothing the time spent calling the method is likely to be very small compared to executing the content of the method. So while this may be an interesting academic question, in real terms you probably have bigger speedup opportunities. — Mike Dunlavey, Feb 24 '14 at 16:59
You are not seeing branch mispredicts on the instruction itself because samples are going to be "aggregated" on the next instruction after the branch. — Elalfer, Feb 24 '14 at 23:49
Check intel's optimization manual http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf section "B.6.3.2 Virtual Tables and Indirect Calls": "18. Virtual Table Misuse: BR_CALL_MISSP_EXEC / BR_INST_RETIRED.MISPRED" — osgx, Mar 15 '14 at 07:48

Elalfer · Answer 1 · 2014-02-25T15:02:18.763

You are not seeing branch mispredicts on the instruction itself because samples are going to be "aggregated" on the next instruction after the branch.

Same true to all non-precise events (without _PS at the end). One might easily find it out just by checking regular code profile. For example, with higher possibility one will find that there are more CPU_CLK_UNHALTED samples on a simple add, than on a heavy imul which came just before the add.

In order to see "exact" instruction where event happened you must use precise events such as BR_MISP_RETURED.ALL_BRANCHES_PS.

I'm not 100% sure about a true nature of this "issue", and I know it should be possible to fix it, but for some reason VTune sampling driver guys do not want to do that. I know one guy who is fighting this issue for last 6 years and I take this into account every time I check asm VTune profile :)

PS. Regarding original test with virtual functions. I've tested it as well, and it does generate a lot of branch miss-predicts. Same true for function pointers. One way to fix it is to use template classes, if possible.

I read somewhere that using a precise event slows down the whole CPU, which is why using it is supposed to be a last resort. — Zan Lynx, May 05 '14 at 22:20
Never heard of it. Just be careful with with "Sample after" value — Elalfer, May 13 '14 at 18:16

score 2 · Answer 2 · edited May 23 '17 at 11:44

I will try to answer this first part of the question:

Could somebody please comment on whether this seems sensible? To me it doesn't- how can there be no branch misprediction or instruction cache miss statistics for a line of polymorphic code where the branch target will constantly be changing per message?

This cannot be due to compiler optimizations/inlining because the compiler wouldn't know the subtype of the object to optimize.

There is actually a way for a compiler to inline calls to virtual functions, it's kind of an interesting trick and I was surprised when I learnt about it.

You can watch this Eric Brumer's talk for more details, starting from 22:30 min mark he talks about indirect calls optimization.

Basically, instead of issuing a simple jump instruction to that virtual function pointer, compiler adds some comparisons first, and for some known values of pointers predict the specific virtual function called, and then that call can be inlined inside that branch. In that case the unpredictable pointer value jump turns into a simple comparison branch prediction, and modern CPUs are good at that. So if most of the calls are going to be into the same specific virtual function implementation, you may see good prediction numbers and low instruction cache miss numbers.

I'd recommend looking into dis-assembly for that function call. Does it honestly jump to the code using vtable pointers indirection, or does it avoid vtable jump via some optimization.

If the call is not optimized by compiler there's still some way for a CPU to speculate, dig into Branch Target Buffer. For example, if this function is called in a tight loop on the object of the same type, then it may not matter if it's virtual or not, its address may be predicted...

HTH.

VTune profiling shows no metrics for branch prediction on polymorphic function?

2 Answers2