Visual C++ Math functions run faster in Win32

Question

There's something I've noticed when profiling a VS2013 program with VTune that I haven't been able to find a proper explanation for.

Essentially, one of the non-negligible hotspots for the program are the mathematical functions. cos, sin, sqrt, etc. We do a lot of math calculations.

However, in the win32 build, the functions that are called are libm_sse2_cosprecise, libm_sse2_sqrtprecise, etc. In x64, cos, sin, etc. are called. It turns out that performance is slower in x64 than in win32 for all the mathematical functions.

Compiler settings are near identical between 32 and 64-bit builds, the only differences should have no impact on math functions. We use double-precision most of the time.

So why are the 64-bit math functions slower? Why isn't this documented somewhere? Did I miss anything? Am I not understanding something properly?

The only information I could really find on intrinsics for math functions in Microsoft's documentation (which is rather lacking on this topic, I feel), is this page

Since we use the default floating-point accuracy (which should be fp-precise), this wouldn't apply in our case, even with /O2 enabled, if I understood correctly. (Enabling intrinsic functions is off in our latest tests, btw). Besides, it wouldn't explain why 32 and 64-bit outputs differ...

Any information or pointers would be appreciated.

To back up your claim, you will need to be clearer about the exact functions you used. Did you use libm_sse2_cosprecise on both? Also remember sse2 does not use the 32/64 bit registers. In fact both x86 and x64 processors use a seperate Floating Point Unit that does that kind of math for them anyway so even normal floating point operations hardly rely on the register width (until it comes to moving). — Nonanon, Nov 23 '16 at 10:15
How exactly did you do your testing? Also, since it might be relevant, what CPU hardware does your test machine have? e.g. Intel Haswell (i3/5/7 - 4xxx)? What performance ratio did you find? (or better, what were the exact CPU cycle counts per call for each case, as measured with the core_cycles performance counter (not reference cycles with RDTSC)?) — Peter Cordes, Nov 23 '16 at 10:17
I can see [VS2015 emits `__libm_sse2_cos_precise` in 32-bit mode and `cos` in 64-bit mode](https://godbolt.org/g/GcUqlZ) — phuclv, Nov 23 '16 at 10:34
try switching to "fast" fp math mode instead of precise to force SSE to be used. also set SSE explicitly as the instruction set to be used. — Sven Nilsson, Nov 23 '16 at 10:47
I seem to be unable to respond directly to some of the questions, so will answer all of them here. math.h is included, and the math functions are called using their common name (cos, sin, sqrt, etc.). No explicit call to libm_xxx was ever made in our code. Profiling was done with VTune amplifier. Performance ratio is approximately 1/2. I have a test case in which .46 seconds are spent in libm_sse2_cosprecise in 32-bit but 1.1 seconds are spent in cos in 64-bit. Switching to fast fp math is not an option for us, and it doesn't provide any explanation for the differences between builds. — spliblib, Nov 23 '16 at 10:49
To notify people when you reply to them, include @PeterCordes in your comment, for example. (You always get notified of comments under your post, so I don't have to @ spliblib you). — Peter Cordes, Nov 23 '16 at 11:00
Do you see the same difference for `sqrt`? It's implemented in hardware in SSE2, so there's nothing for a math-library routine to do except run [`sqrtsd xmm0, xmm0`](http://www.felixcloutier.com/x86/SQRTSS.html) (and check for NaN + set `errno`, stupid standard). — Peter Cordes, Nov 23 '16 at 11:03
Can you confirm that the difference isn't due to cache misses? Performance-counter cycles are usually attributed to the instruction that has to wait for data, not the instruction that was slow to produce it. So instead of seeing a high cycle count on load instructions, you might be seeing a high cycle count on the instructions inside a math library if your test case loads data from memory and passes it directly to a math function). Larger pointers -> more cache misses is unlikely to explain an exact factor-of-two difference that's consistent across different functions, though. — Peter Cordes, Nov 23 '16 at 11:06
@PeterCordes It is the same for sqrt (see Lưu Vĩnh Phúc's link in his comment as well, can't seem to @ his name). __libm_sse2_sqrt_precise is called. I don't think there are cache misses. No LLC cache misses are reported by VTune for the two functions that call cos the most in 64-bit. They do a few stores, but the parameter that is given to cos will in just about every situation I can think of have gone through at least a couple of function calls beforehand and will not require a cache miss. — spliblib, Nov 23 '16 at 11:24
So the *performance* difference is the same for sqrt? Not just the function name? Is there any obvious difference when you single-step through a call in assembly? I don't have MSVC or Windows; I've only single-stepped into library functions on GNU/Linux. (btw, @ username stuff auto-completes, so Just do `@L` and hit tab. The second character of that name is non-ascii, so only type the `l`.) — Peter Cordes, Nov 23 '16 at 11:38
Very different library implementations, little reason to assume that perf will be comparable. And perhaps the problem is that the SSE2 flavor does not suck enough. You can declare `extern "C" int __use_fma3_lib;` and force it to 0 to get another data point, it won't use FMA3 instructions anymore. But you'll have to compile with /MT to get it to link. — Hans Passant, Nov 23 '16 at 11:56
@HansPassant Could you expand on that ? Why does MSVC switch between these implementations without indication in its documentation and without my asking between 32 and 64 bit ? I thought SSE2 was enabled by default for x64. I just want the x64 build to run faster and I don't fully understand the differences between the builds here. — spliblib, Nov 23 '16 at 13:11
@PeterCordes Both functions call sqrstd at some point (assembly is also in the link Lưu Vĩnh Phúc's comment). And yes, the performance difference is similar. — spliblib, Nov 23 '16 at 13:11
Oh neat, I didn't know there was a beta version of Godbolt with CL19! But that's just the asm at the call-site, not the asm for the library implementation, or any DLL-call overhead. Possible differences in calling DLL functions in 32 vs. 64-bit is one reason I suggested single-stepping into it, not just looking at disassembly output for the call-site and the library. (But also because it can be hard to find the right place in a library, if it uses any weird symbol tricks to select versions at dynamic-link time.) — Peter Cordes, Nov 23 '16 at 13:17
Nothing much to expand on, these libraries are owned by Intel and Microsoft could not get a source license for them. So it is all a black box with no details beyond what you can disassemble. Typical Intel btw, their business practices kinda suck. — Hans Passant, Nov 23 '16 at 13:19
"Switching to fast fp math is not an option for us"... do you realize that "fast" FP math means SSE math? You seem to be asking why SSE is faster, but you don't want to use SSE. — Sven Nilsson, Nov 23 '16 at 13:28
@SvenNilsson do you realize that [`fp:fast` means a completely different thing](http://stackoverflow.com/q/6889522/995714) and is an equivalent to [`-ffast-math` in gcc](http://stackoverflow.com/q/26450193/995714) https://msdn.microsoft.com/en-us/library/e7s85ffb.aspx [Do You Prefer Fast or Precise?](https://blogs.msdn.microsoft.com/vcblog/2015/10/19/do-you-prefer-fast-or-precise/) — phuclv, Nov 23 '16 at 15:40
don't know why `sqrtsd` is not used. It seems `__libm_sse2_sqrt_precise` first appeared in VS2012 [`sqrt()` code optimization is twice faster on VS2008/2010 than VS2012 (SSE)](https://social.msdn.microsoft.com/Forums/vstudio/en-US/ebab293c-0c85-462e-a352-22ff8ee55c36/sqrt-code-optimization-is-twice-faster-on-vs20082010-than-vs2012-sse?forum=vcgeneral) @spliblib because you can only tag one person in one comment — phuclv, Nov 23 '16 at 15:40
some information on `__libm_sse2_sqrt_precise` implementation http://marc.info/?l=wine-cvs&m=140053068806707&w=2 http://en.stack.aiseen.org/questions/15779156/strange-fp-floating-point-model-flag-behavior — phuclv, Nov 23 '16 at 15:51
@Lưu Vĩnh Phúc thanks for the info. It appears that /fp:fast only increases the chances that SSE will be used. It is still up to the compiler, and VS2010 in particular is a devil when it comes to sneaking in X87 math everywhere, unless you build 64-bit binaries. — Sven Nilsson, Nov 24 '16 at 09:45

Visual C++ Math functions run faster in Win32

0 Answers0