9

I have the following code and am expecting the intrinsic version of the exp() function to be used. Unfortunately, it is not in an x64 build, making it slower than a similar Win32 (i.e., 32-bit build):

#include "stdafx.h"
#include <cmath>
#include <intrin.h>
#include <iostream>

int main()
{
  const int NUM_ITERATIONS=10000000;
  double expNum=0.00001;
  double result=0.0;

  for (double i=0;i<NUM_ITERATIONS;++i)
  {
    result+=exp(expNum); // <-- The code of interest is here
    expNum+=0.00001;
  }

  // To prevent the above from getting optimized out...
  std::cout << result << '\n';
}

I am using the following switches for my build:

/Zi /nologo /W3 /WX-
/Ox /Ob2 /Oi /Ot /Oy /GL /D "WIN32" /D "NDEBUG" 
/D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- 
/EHsc /GS /Gy /arch:SSE2 /fp:fast /Zc:wchar_t /Zc:forScope 
/Yu"StdAfx.h" /Fp"x64\Release\exp.pch" /FAcs /Fa"x64\Release\" 
/Fo"x64\Release\" /Fd"x64\Release\vc100.pdb" /Gd /errorReport:queue 

As you can see, I do have /Oi, /O2 and /fp:fast as required per the MSDN article on intrinsics. Yet, despite my efforts a call to the standard library is made, making exp() perform slower on x64 builds.

Here is the generated assembly:

  for (double i=0;i<NUM_ITERATIONS;++i)
000000013F911030  movsd      xmm10,mmword ptr [__real@3ff0000000000000 (13F912248h)]  
000000013F911039  movapd     xmm8,xmm6  
000000013F91103E  movapd     xmm7,xmm9  
000000013F911043  movaps     xmmword ptr [rsp+20h],xmm11  
000000013F911049  movsd      xmm11,mmword ptr [__real@416312d000000000 (13F912240h)]  
  {
    result+=exp(expNum);
000000013F911052  movapd     xmm0,xmm7  
000000013F911056  call       exp (13F911A98h) // ***** exp lib call is here *****
000000013F91105B  addsd      xmm8,xmm10  
    expNum+=0.00001;
000000013F911060  addsd      xmm7,xmm9  
000000013F911065  comisd     xmm8,xmm11  
000000013F91106A  addsd      xmm6,xmm0  
000000013F91106E  jb         main+52h (13F911052h)  
  }

As you can see in the assembly above, there is a call out to the exp() function. Now, let's look at the code generated for that for loop with a 32-bit build:

  for (double i=0;i<NUM_ITERATIONS;++i)
00101031  xorps       xmm1,xmm1  
00101034  rdtsc  
00101036  push        ebx  
00101037  push        esi  
00101038  movsd       mmword ptr [esp+1Ch],xmm0  
0010103E  movsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101046  push        edi  
00101047  mov         ebx,eax  
00101049  mov         dword ptr [esp+3Ch],edx  
0010104D  movsd       mmword ptr [esp+28h],xmm0  
00101053  movsd       mmword ptr [esp+30h],xmm1  
00101059  lea         esp,[esp]  
  {
    result+=exp(expNum);
00101060  call        __libm_sse2_exp (101EC0h) // <--- Quite different from 64-bit
00101065  addsd       xmm0,mmword ptr [esp+20h]  
0010106B  movsd       xmm1,mmword ptr [esp+30h]  
00101071  addsd       xmm1,mmword ptr [__real@3ff0000000000000 (102180h)]  
00101079  movsd       xmm2,mmword ptr [__real@416312d000000000 (102178h)]  
00101081  comisd      xmm2,xmm1  
00101085  movsd       mmword ptr [esp+20h],xmm0  
    expNum+=0.00001;
0010108B  movsd       xmm0,mmword ptr [esp+28h]  
00101091  addsd       xmm0,mmword ptr [__real@3ee4f8b588e368f1 (102188h)]  
00101099  movsd       mmword ptr [esp+28h],xmm0  
0010109F  movsd       mmword ptr [esp+30h],xmm1  
001010A5  ja          wmain+40h (101060h)  
  }

Much more code there, yet it's faster. A timing test I did on a 3.3 GHz Nehalem-EP host produced the following results:

32-bit:

For loop body average exec time: 34.849229 cycles / 10.560373 ns

64-bit:

For loop body average exec time: 45.845323 cycles / 13.892522 ns

Very odd behavior, indeed. Why is it happening?

Update:

I have created a Microsoft Connect bug report. Feel free to upvote it to get an authoritative answer from Microsoft itself on the use of floating point intrinsics, especially in x64 code.

Michael Goldshteyn
  • 71,784
  • 24
  • 131
  • 181
  • [This article](http://blogs.msdn.com/b/ricom/archive/2009/06/10/visual-studio-why-is-there-no-64-bit-version.aspx) (explaining why VS does not have a 64bit version) points out that a 64 bit build can be slower than a 32 bit one. I do not know if this explanation is the one that applies to your specific case, though. – Attila Apr 10 '12 at 20:01
  • 1
    That article is about a 64-bit version of Visual Studio itself, it has nothing to do with the question posed. There are many factors that can make a 64-bit application slower than a 32-bit one. Unless, I am missing something, none of these factors have anything to do with my question about floating point computation, however. – Michael Goldshteyn Apr 10 '12 at 20:04
  • 1
    GregC, removing /D "WIN32" had no effect on the generated code. – Michael Goldshteyn Apr 10 '12 at 20:11
  • 1
    @GregC, regarding your link to software.intel.com..., we are not using the SVML library in our projects, so no I haven't. I am just trying to get the build to live up to Microsoft's "guarantees" based on MSDN. – Michael Goldshteyn Apr 10 '12 at 20:12
  • Have you tried putting `#pragma intrinsic(exp)` after your `#include`s? Also, try including `math.h` rather than `cmath`. – ildjarn Apr 10 '12 at 20:54
  • No change resulted from the change of include file. Adding `#pragma intrinsic(exp)` only gave me the error: `exp.cpp(7): warning C4164: 'exp' : intrinsic function not declared` – Michael Goldshteyn Apr 10 '12 at 20:58
  • That warning may be key here, and if you can get the right configuration/set of includes to make that warning go away you'll be on the right track. I'll investigate later tonight if you can't find anything. – ildjarn Apr 10 '12 at 21:15
  • @ildjarn, sadly, the thing it's key to is that the MSDN article entry for intrinsic functions using SSE2 is "full of it." – Michael Goldshteyn Apr 10 '12 at 21:21
  • I'm not ruling out a compiler/stdlib bug, but it's possible that function is not eligible to be an intrinsic due to some weird configuration issue. :-] – ildjarn Apr 10 '12 at 21:26
  • In the future, include StackOverflow links in Connect bug reports. Many Microsoft compiler engineers like seeing and participating in the existing discussion. – Ben Voigt Apr 11 '12 at 15:24
  • @BenVoigt Cross-linked, too easy. Wish this was too easy :) – GregC Apr 11 '12 at 16:26

3 Answers3

5

On x64, floating point arithmetic is performed using SSE. This does not have a built-in operation for exp() and so a call to the standard library is inevitable unless you write your own inline manually-vectorized __m128d exp(__m128d) (Fastest Implementation of Exponential Function Using SSE).

I imagine that the MSDN article you are referring to was written with 32 bit code that uses 8087 FP in mind.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
David Heffernan
  • 601,492
  • 42
  • 1,072
  • 1,490
  • Please see my edited question which includes the code generated by a 32-bit build and a timing comparison of 32-bit vs. 64-bit. Neither build is using a "true" intrinsic, but there are differences in the function that is called and the 32-bit build is significantly faster. – Michael Goldshteyn Apr 10 '12 at 20:36
  • Well maybe, but the fact remains that there's no exp intrinsic in any of the SSE opcodes – David Heffernan Apr 10 '12 at 20:39
  • That is true, but I was expecting, per the MSDN documentation for an intrinsic implementation of exp() to be inlined in my (assembly) code. – Michael Goldshteyn Apr 10 '12 at 20:40
  • I bet that documentation simply has not been updated to account for SSE codegen. And I suspect that if you remove /arch:sse2 from your options and target 8087 FPU then you will see the intrinsic call being made. – David Heffernan Apr 10 '12 at 20:43
  • Removing SSE2 from the 32-bit build does produce completely different code, which uses 8087 "f" instructions and I do not see any `exp()` lib call. The code is almost three times slower, though. It does seem that you are on to something there, however. For 64-bit builds, it is impossible to disable the use of SSE2 in the compiler, since all 64-bit processors must support it. Therefore, there is no change in the generated (assembly) code. – Michael Goldshteyn Apr 10 '12 at 20:48
  • Then you have these sorts of statements on MSDN (See the Remarks section), that totally contradict the inability to insert an SSE2 enhanced version of `exp()`: http://msdn.microsoft.com/en-us/library/bthd138d.aspx – Michael Goldshteyn Apr 10 '12 at 20:55
  • Yes, that's exactly what I would expect. And I'm here referring to the 8087 comment. – David Heffernan Apr 10 '12 at 20:56
1

I think the only reason that Microsoft provides an intrinsic version of 32-bit SSE2 exp() is the standard calling conventions. The 32-bit calling conventions require the operand to be pushed on the main stack, and the result to be returned in the top register of the FPU stack. If you have SSE2 code generation enabled, then the return value is likely to be popped from the FPU stack into memory, then loaded from that location into an SSE2 register for whatever maths you want to do on the result. Clearly, it is faster to pass the operand in an SSE2 register and return the result in an SSE2 register. This is what __libm_sse2_exp() does. In 64-bit code, the standard calling convention passes the operand and returns the result in SSE2 registers anyway, so there is no advantage in having an intrinsic version.

The reason for the performance difference between 32-bit SSE2 and 64-bit implementations of exp() is that Microsoft uses different algorithms in the two implementations. I've no idea why they do this, and they produce different results (different by 1ulp) for some operands.

dc42
  • 314
  • 3
  • 6
0

EDIT I'd like to add to this discussion the link to AMD's x64 instruction set manuals and Intel's reference.

At an initial inspection, there should be a way to use F2XM1 to compute the exponential. However, it's in the x87 instruction set, hidden in x64 mode.

There's hope in using MMX/x87 explicitly, as described in a posting on VirtualDub discussion boards. And, this is how to actually write asm in VC++.

Community
  • 1
  • 1
GregC
  • 7,737
  • 2
  • 53
  • 67