Background
The following critical loop of a piece of numerical software, written in C++, basically compares two objects by one of their members:
for(int j=n;--j>0;)
asd[j%16]=a.e<b.e;
a
and b
are of class ASD
:
struct ASD {
float e;
...
};
I was investigating the effect of putting this comparison in a lightweight member function:
bool test(const ASD& y)const {
return e<y.e;
}
and using it like this:
for(int j=n;--j>0;)
asd[j%16]=a.test(b);
The compiler is inlining this function, but the problem is, that the assembly code will be different and cause >10% of runtime overhead. I have to question:
Questions
Why is the compiler prodrucing different assembly code?
Why is the produced assembly slower?
EDIT: The second question has been answered by implementing @KamyarSouri's suggestion (j%16). The assembly code now looks almost identical (see http://pastebin.com/diff.php?i=yqXedtPm). The only differences are the lines 18, 33, 48:
000646F9 movzx edx,dl
Material
- The test code: http://pastebin.com/03s3Kvry
- The assembly output on MSVC10 with /Ox /Ob2 /Ot /arch:SSE2:
- Compiler inlined version: http://pastebin.com/yqXedtPm
- Manually inlined version: http://pastebin.com/pYSXL77f
- Difference http://pastebin.com/diff.php?i=yqXedtPm
This chart shows the FLOP/s (up to a scaling factor) for 50 testruns of my code.
The gnuplot script to generate the plot: http://pastebin.com/8amNqya7
Compiler Options:
/Zi /W3 /WX- /MP /Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /MT /GS- /Gy /arch:SSE2 /fp:precise /Zc:wchar_t /Zc:forScope /Gd /analyze-
Linker Options: /INCREMENTAL:NO "kernel32.lib" "user32.lib" "gdi32.lib" "winspool.lib" "comdlg32.lib" "advapi32.lib" "shell32.lib" "ole32.lib" "oleaut32.lib" "uuid.lib" "odbc32.lib" "odbccp32.lib" /ALLOWISOLATION /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /SUBSYSTEM:CONSOLE /OPT:REF /OPT:ICF /LTCG /TLBID:1 /DYNAMICBASE /NXCOMPAT /MACHINE:X86 /ERRORREPORT:QUEUE