Why does compiler inlining produce slower code than manual inlining?

Question

Background

The following critical loop of a piece of numerical software, written in C++, basically compares two objects by one of their members:

for(int j=n;--j>0;)
    asd[j%16]=a.e<b.e;

a and b are of class ASD:

struct ASD  {
    float e;
    ...
};

I was investigating the effect of putting this comparison in a lightweight member function:

bool test(const ASD& y)const {
    return e<y.e;
}

and using it like this:

for(int j=n;--j>0;)
    asd[j%16]=a.test(b);

The compiler is inlining this function, but the problem is, that the assembly code will be different and cause >10% of runtime overhead. I have to question:

Questions

Why is the compiler prodrucing different assembly code?
Why is the produced assembly slower?

EDIT: The second question has been answered by implementing @KamyarSouri's suggestion (j%16). The assembly code now looks almost identical (see http://pastebin.com/diff.php?i=yqXedtPm). The only differences are the lines 18, 33, 48:

000646F9  movzx       edx,dl

Material

The test code: http://pastebin.com/03s3Kvry
The assembly output on MSVC10 with /Ox /Ob2 /Ot /arch:SSE2:
- Compiler inlined version: http://pastebin.com/yqXedtPm
- Manually inlined version: http://pastebin.com/pYSXL77f
- Difference http://pastebin.com/diff.php?i=yqXedtPm

This chart shows the FLOP/s (up to a scaling factor) for 50 testruns of my code.

enter image description here

The gnuplot script to generate the plot: http://pastebin.com/8amNqya7

Compiler Options:

/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /MT /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Gd /analyze-

Linker Options: /INCREMENTAL:NO "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE

Nice question. It might be instructive to post the optimization and other relevant compiler settings used to produce this result. — Ted Hopp, Dec 21 '11 at 01:08
@user578832: not an answer to your question but wouldn't *"loop unrolling"* be useful in your case? Wouldn't it help get more precise readings between the two methods you're comparing? (you may be spending less time in the *for* comparison and you may notice even bigger differences) — TacticalCoder, Dec 21 '11 at 01:09
I haven't read the assembler, but won't the compiler pull the comparison out of the loop, because it's the same thing on each iteration? — Oliver Charlesworth, Dec 21 '11 at 01:11
@user988052: See the assembly. The loop has been unrolled fivefold. — Johannes Gerer, Dec 21 '11 at 01:17
Oh man... Somebody seems to have learned the art of asking stellar questions... Only this time, I don't have the answer off the top of my head... — Mysticial, Dec 21 '11 at 01:18
Maybe try some other compilers? I've never had much luck getting really good code out of MSVC. Try Borland and gcc. — wallyk, Dec 21 '11 at 01:23
@Johannes Gerer: `j%10` can take quite some time compared to the `a.e — Kamyar Souri, Dec 21 '11 at 01:24
@wallyk: Yes, i might do that. But it's also interesting to note, that one of the major C++ Compilers (MSVC) does not inline for free! This kind of problems will require you to perform micro optimization, which in turn is prone to premature optimization and bad! — Johannes Gerer, Dec 21 '11 at 01:28
@JohannesGerer: I wouldn't jump to conclusions just yet. There may be a good reason it's generating poorer code that doesn't apply in other situations. — Ron Warholic, Dec 21 '11 at 01:30
@KamyarSouri: Thanks for that hint: As expected the difference gets bigger: http://i.stack.imgur.com/BlGeJ.png — Johannes Gerer, Dec 21 '11 at 01:31
Frankly, it would be a pretty poor compiler that couldn't out-inline a programmer doing it manually, in most cases. But two things to keep in mind: 1) The compiler may be limited by some perceived "threat" that, eg, one of the variables can be modified asynchronously or be aliased somehow or some such. 2) Sometimes it's just "bad luck" that code falls on a bad cache boundary or the memory queue gets bottlenecked, etc. A good scheduler can fix this, but it would help both compiler and hand-optimizer equally. — Hot Licks, Dec 21 '11 at 01:45
@HotLicks: You are completely right, in general. Yet it seems that this is an exmaple of such a "poor" performance. What "thread" can you spot in this exmaple? And 2): I didn't write assembly by hand, I just inlined the function by hand. Its still the compiler doing the compliation. It's not "hand-optimized". The compiler is loosing against himself. — Johannes Gerer, Dec 21 '11 at 01:53
Well, in this case it's apparently a M$ compiler, so that may be a big chunk of the problems. But it may be, as someone else hinted, that the compiler at least partially thinks it's in debug mode, or otherwise has optimization "dialed down". — Hot Licks, Dec 21 '11 at 02:02
@HotLicks: The assembly suggests otherwise. Both versions are heavily optimized. — Johannes Gerer, Dec 21 '11 at 02:06
Yeah, from Mystical's post it appears to be mostly "bad luck". There's a certain statistical nature to optimization -- the same optimization that "wins" 99% of the time can bite you 1% of the time. And being a little careless with that xor implementation usually wouldn't matter, but maybe does in this case (or the problem may be unrelated and due to slight cache boundary differences or some such. I've even seen cases where a program would run at different speeds when recompiled multiple times, just based on how it got mapped to memory. — Hot Licks, Dec 21 '11 at 02:18
The question remains. After the compiler performs the inlining it should have the exact same code in its internal representation of the code. Then it should optimize this "identical code". Instead it inserts a (redundant) assembly instruction — Johannes Gerer, Dec 21 '11 at 02:51
@JohannesGerer One thing for sure is that getting rid of that extra instruction will require a very specific [instruction-combining](http://www.compileroptimizations.com/category/instruction_combining.htm) pass in the compiler. [Dead-code Elimination](http://en.wikipedia.org/wiki/Dead_code_elimination) won't get rid of it. As for why that zero-extend instruction was generated in the first place, you'd have to ask the MSVC developers. And they aren't likely to tell you because it's proprietary. — Mysticial, Dec 21 '11 at 03:06
@JohannesGerer I think I figured it out. Updating answer in a bit. — Mysticial, Dec 21 '11 at 03:21

score 32 · Accepted Answer · edited Jun 20 '20 at 09:12

Short Answer:

Your asd array is declared as this:

int *asd=new int[16];

Therefore, use int as the return type rather than bool.
Alternatively, change the array type to bool.

In any case, make the return type of the test function match the type of the array.

Skip to bottom for more details.

Long Answer:

In the manually inlined version, the "core" of one iteration looks like this:

xor         eax,eax  
 
mov         edx,ecx  
and         edx,0Fh  
mov         dword ptr [ebp+edx*4],eax  
mov         eax,dword ptr [esp+1Ch]  
movss       xmm0,dword ptr [eax]  
movss       xmm1,dword ptr [edi]  
cvtps2pd    xmm0,xmm0  
cvtps2pd    xmm1,xmm1  
comisd      xmm1,xmm0

The compiler inlined version is completely identical except for the first instruction.

Where instead of:

xor         eax,eax

it has:

xor         eax,eax  
movzx       edx,al

Okay, so it's one extra instruction. They both do the same - zeroing a register. This is the only difference that I see...

The movzx instruction has a single-cycle latency and 0.33 cycle reciprocal throughput on all the newer architectures. So I can't imagine how this could make a 10% difference.

In both cases, the result of the zeroing is used only 3 instructions later. So it's very possible that this could be on the critical path of execution.

While I'm not an Intel engineer, here's my guess:

Most modern processors deal with zeroing operations (such as xor eax,eax) via register renaming to a bank of zero registers. It completely bypasses the execution units. However, it's possible that this special handling could cause a pipeline bubble when the (partial) register is accessed via movzx edi,al.

Furthermore, there's also a false dependency on eax in the compiler inlined version:

movzx       edx,al  
mov         eax,ecx  //  False dependency on "eax".

Whether or not the out-of-order execution is able to resolve this is beyond me.

Okay, this is basically turning into a question of reverse-engineering the MSVC compiler...

Here I'll to explain why that extra movzx is generated as well as why it stays.

The key here is the bool return value. Apparently, bool datatypes are probably as stored 8-bit values inside the MSVC internal-representation. Therefore when you implicitly convert from bool to int here:

asd[j%16] = a.test(b);
^^^^^^^^^   ^^^^^^^^^
 type int   type bool

there is an 8-bit -> 32-bit integer promotion. This is the reason why MSVC generates the movzx instruction.

When the inlining is done manually, the compiler has enough information to optimize out this conversion and keeps everything as a 32-bit datatype IR.

However, when the code is put into it's own function with a bool return value, the compiler is not able to optimize out the 8-bit intermediate datatype. Therefore, the movzx stays.

When you make both datatypes the same (either int or bool), no conversion is needed. Hence the problem is avoided altogether.

Sorry, I just found the same instruction after updaten the program. Well, as you know the art of asking stellar answers, this should hurt your answer. — Johannes Gerer, Dec 21 '11 at 02:07
Actually, the same thing holds. Replacing the `%10` with `%16` only gets rid of the multiply and shift logic. The `movzx` is still there - only in compiler inlined version. — Mysticial, Dec 21 '11 at 02:10
Come to think of it, I think your edit has strengthened my answer... It removes the noise of the `%10` and adds a new reason for a stall. — Mysticial, Dec 21 '11 at 02:17
You solved it! You might call this a higher form of return value optimization (which MSVC is missing here). Before I posted this strange code, the only cause I could imagine actually was the `bool? -> `int` conversion, but I immediately discarded this idea. — Johannes Gerer, Dec 21 '11 at 04:11
Yeah, that was another fun question to answer. Hacking into the compiler and assembly... — Mysticial, Dec 21 '11 at 04:21
Those extra totally-redundant `movzx` instructions change the alignment of the rest of the code. Since this loop is more than 28 uops, it doesn't fit in the loop buffer on Nehalem or SnB. (I count it at more like 47 uops, assuming the stores with two-register addressing modes can't micro-fuse. So it could fit in the loop buffer of a Haswell with hyperthreading disabled.) On Nehalem, that means the decoders come into play. On Intel SnB and later, alignment affect the uop cache. The OP didn't say what CPU this was tested on, but 10% is easily believable. It's a frontend bottleneck. — Peter Cordes, Nov 16 '15 at 02:37
And just for the record, that's a *really* bad implementation of this function. The `movzx` into a register that's about to be overwritten without being read is really dumb. Using `jbe / mov reg, 1 / jmp / xor reg,reg` instead of `setbe dl` is also *really* dumb. If it's going to unroll, it should drop some of the per-iteration increment/decrement and use the same addresses with different offsets. Err, actually it's the same scalar test every time? Pull it out of the loop... IDK what it's doing reloading addresses *from* the stack either. — Peter Cordes, Nov 16 '15 at 02:41
And since nobody linked it on this question, **essential reading**: http://agner.org/optimize/ — Peter Cordes, Nov 16 '15 at 02:49

Windows programmer · Answer 2 · 2011-12-21T02:12:40.123

1

lea esp,[esp] occupies 7 bytes of i-cache and it's inside the loop. A few other clues make it look like the compiler isn't sure if this is a release build or a debug build.

Edit:

The lea esp,[esp] isn't in the loop. The position among the surrounding instructions misled me. Now it looks like it intentionally wasted 7 bytes, followed by another wasted 2 bytes, in order to start the actual loop at a 16-byte boundary. Which means that this actually speeds things up, as observed by Johennes Gerer.

The compiler still seems to be uncertain whether this is a debug or release build though.

Another edit:

The pastebin diff is different from the pastebin diff that I saw earlier. This answer could be deleted now, but it already has comments so I'll leave it.

edited Dec 21 '11 at 02:12

answered Dec 21 '11 at 01:36

Windows programmer

7,871
1
22
23

Yes, but the `lea esp,[esp]` is in the (faster) manually inlined version! – Johannes Gerer Dec 21 '11 at 01:41
How could I insist on a "Release build"? You can find the compiler options I used in the OP – Johannes Gerer Dec 21 '11 at 01:54