4

The code:

double Ret_Value=0;

on default settings VS2012 compiles to:

10112128  xorps       xmm0,xmm0  
1011212E  movsd       mmword ptr [Ret_Value],xmm0

If SSE2 is disabled in project settings this is compiled to:

101102AC  fldz  
101102AE  lea         eax,[Ret_Value]  
101102B1  push        eax  
101102B2  fstp        qword ptr [Ret_Value] 

Edit: I am not sure that push and lea are related to this initialization, maybe it is for stuff done after that, just disassembly shows them for this C++ line of code.

Is SSE2 significantly better? Except that it is 2 instructions shorter? What kind of optimization is done here?

How this was discovered: the app started to fail on an old processor which doesn't support SSE2.

phuclv
  • 37,963
  • 15
  • 156
  • 475
Andrey
  • 59,039
  • 12
  • 119
  • 163
  • 1
    I think it is a simple code size optimization. Shorter code is more cache friendly, and so faster. Moreover, not using the stack saves a couple of memory accesses. – rodrigo Jan 23 '15 at 10:53
  • Not sure what the middle 2 instructions are supposed to do. Just to zero the `Ret_Value` all you need is `fldz; fstp [Ret_Value]`. – Jester Jan 23 '15 at 10:54
  • @Jester: Probably it is a pipeline re-arrangement (intermix floating and non-floating instructions) because the `eax` in the stack will be needed later, I guess that for a function call or something like that. – rodrigo Jan 23 '15 at 10:55
  • @Jester yeah indeed lea and push seem unnecessary, I will look for more code around it. – Andrey Jan 23 '15 at 10:56
  • @rodrigo pipeline will rearrange what it needs and intermix what is necessary on the go. – Andrey Jan 23 '15 at 10:56
  • @rodrigo yes but then similar code must exist for the SSE version too, not making it shorter ;) – Jester Jan 23 '15 at 10:56
  • @Andrey this is 32-bit or 64-bit code? By the looks of `push eax` it's 32-bits.... Also, SSE code has higher throughput than x87 code. – Iwillnotexist Idonotexist Jan 23 '15 at 10:57
  • @IwillnotexistIdonotexist it is 32bit. – Andrey Jan 23 '15 at 11:02
  • @Andrey then this is somewhat easy to explain. A double is 64-bits, and the widest x86-32 scalar register is, well, 32 bits. It is impossible to clear such a scalar register to 0 and write from it twice to a double's memory location in under 3 instructions, and besides this would be a non-atomic access. While the XMM registers are 128-bits, all of which can be cleared in one instruction (`xorps`), and there exists an instruction (`movsd`) which reads the lower 64-bits of the register and stores them simultaneously to the target memory location (2 instructions). And again, SSE has > throughput. – Iwillnotexist Idonotexist Jan 23 '15 at 11:15
  • 4
    Check [this post](http://stackoverflow.com/a/14865279/17034), it explains why the FPU is evil and Intel decided to replace it. It took a while before compiler writers followed suit, Microsoft was the last mainstream one. – Hans Passant Jan 23 '15 at 13:52
  • 2
    MSVC with /arch:IA32 will not generate any SSE instructions for 32-bit code. MSVC x64 native codegen uses SSE/SSE2 and no x87 instructions. MSVC 32-bit using /arch:SSE or /arch:SSE2 tries to use SSE instructions and other x64 native codegen machinery. As of VS 2012, the 32-bit compiler defaults to /arch:SSE2--Windows 8.0 explicitly requires SSE/SSE2 support for x86 (32-bit), and it's already required for all Windows x64 processors. – Chuck Walbourn Jan 28 '15 at 08:06
  • 1
    SSE2 is baseline for x86-64, regardless of Windows, in case anyone was wondering. (Of course, in 64-bit code, you could also `push qword 0` or `mov qword ptr [rsp+8], 0` to store 8 bytes of zeros in a single instruction.) – Peter Cordes Feb 17 '22 at 20:58

1 Answers1

4

The Intel Optimization Reference Manual section 3.8.1 (Guidelines for Optimizing Floating-point Code) says -

Enable the compiler’s use of SSE, SSE2 and more advanced SIMD instruction sets (e.g. AVX) with appropriate switches. Favor scalar SIMD code generation to replace x87 code generation.

Section 3.8.5 goes on to explain:

Use Streaming SIMD Extensions 2 or Streaming SIMD Extensions unless you need an x87 feature. Most SSE2 arithmetic operations have shorter latency then their X87 counterpart and they eliminate the overhead associated with the management of the X87 register stack.

Ian Cook
  • 1,188
  • 6
  • 19