Why does .NET use SIMD and not x87 for math operations not intrinsic to SIMD?

Question

This is a question of curiosity more than anything else. I was looking at this code disassembly (C#, 64 bit, Release mode, VS 2012 RC):

            double a = 10d * Math.Log(20d, 2d);
000000c8  movsd       xmm1,mmword ptr [00000138h] 
000000d0  movsd       xmm0,mmword ptr [00000140h] 
000000d8  call        000000005EDC7F50 
000000dd  movsd       mmword ptr [rsp+58h],xmm0 
000000e3  movsd       xmm0,mmword ptr [rsp+58h] 
000000e9  mulsd       xmm0,mmword ptr [00000148h] 
000000f1  movsd       mmword ptr [rsp+30h],xmm0 
            a = Math.Pow(a, 6d);
000000f7  movsd       xmm1,mmword ptr [00000150h] 
000000ff  movsd       xmm0,mmword ptr [rsp+30h] 
00000105  call        000000005F758220 
0000010a  movsd       mmword ptr [rsp+60h],xmm0 
00000110  movsd       xmm0,mmword ptr [rsp+60h] 
00000116  movsd       mmword ptr [rsp+30h],xmm0

... and found it odd that the compiler isn't using x87 instructions for the Logs here (Power uses Logs). Of course, I have no idea what code is at the call locations, but I know that SIMD does not have a Log function, which makes this choice all the more odd. Further, nothing is parellelized here, so why SIMD and not simple x87?

On a lesser note, I also found it odd that the x87 FYL2X instruction isn't being used, which is designed specifically for the case shown in the first line of code.

Can anyone shed any light on this?

Compiler writers haven't gotten around to using that specific optimization? — nneonneo, Sep 12 '12 at 01:19
It's pretty obvious why `FYL2X` isn't used in the first case; the instruction is extremely specific in its use-case, and probably doesn't have *exactly* the same semantics. — nneonneo, Sep 12 '12 at 01:21
Probably, although that op was obviously created because it's such a common requirement. That aside, it doesn't explain why not x87. — IamIC, Sep 12 '12 at 01:21
Possibly related answer: http://stackoverflow.com/a/8870593/516797 — Sean U, Sep 12 '12 at 02:54
got no source so I'm going to leave this as a comment: I believe this is for performance reasons, SIMD is generally slightly faster at the cost of accuracy, and it also avoids the awkward shifting of numbers between x87 and SIMD registers for more complex code. x64 also has 16 XMM registers, double the amount of space that the FPU co-processor has, reducing register pressure, and it means normal reg allocation techniques can be applied, instead of trying to coerce the stack based x87 registers. — Necrolis, Sep 12 '12 at 06:54
Using the legacy `fyl2x` etc. instructions would be an pessimization. Actually SSE versions of these functions are more accurate _and_ faster nowadays. Intel has produced some benchmarks on this, I don't have the exact source handy. — Gunther Piez, Sep 12 '12 at 08:30
The 64bit versions of the CLR never used x87, they've always used scalar SSE for everything. Conversely, the 32bit versions have always used mostly FPU code, using SSE (when available) only for certain casts. — harold, Sep 12 '12 at 10:32
SSE doesn't have Log, Cos, Sin, etc. That would have to be done with Taylor sequences in SSE. Surely that's slower than with x87's built in ops? — IamIC, Sep 12 '12 at 10:43
@IanC well, maybe not. x87's built in transcendentals are quite slow. Up to a hundred cycles even. And actually, there may be better ways than taylor sequences - for example, sin and cos can be approximated by averaging a quadratic function and its square (after range reduction, which you'd have to do anyways), and you can play hacks with the exponent fields to get a good initial approximation of a log (which you can then improve). — harold, Sep 12 '12 at 11:06
@IanC as for why they didn't use SSE as much in 32bit, I have a theory: at first, they didn't want to bother with SSE detection, but all 64bit CPU's have SSE2 so they could use it without detection there. Later they found that casts were prohibitively slow, so they added SSE (and detection of it) just for those, but didn't want to rewrite all the rest. Just a theory though. — harold, Sep 12 '12 at 11:11
I don't know enough about SSE to know why it solves the casting problem. I assume it simply is more efficient at it than x87 (as opposed to avoiding it). — IamIC, Sep 12 '12 at 11:20
@harold I believe you answered my question, but only in comment form. — IamIC, Sep 12 '12 at 11:46
@harold true, except for the part about SSE being more efficient. Ok, we'll wait and see if a .Net engineer answers the question. I won't hold my breath, though :) — IamIC, Sep 12 '12 at 11:55
@IanC the edge cases for log exp and pow documented on MSDN are consistent with an implementation based on Taylor sequences, so they probably took that road anyway.. — harold, Sep 12 '12 at 12:09

score 8 · Accepted Answer · answered Sep 18 '12 at 12:38

8

There are two separate points here. First of all why is the compiler using SSE registers rather than the x87 floating point stack for function arguments, and secondly why the compiler doesn't just use the single instruction that can compute a logarithm.

Not using the logarithm instruction is easiest to explain, the logarithm instruction in x86 is defined to be accurate to 80-bits, whereas you are using a double, which is only 64-bits. Computing a logarithm to 64-bits rather than 80-bits of precision is much faster, and the speed increase more than makes up for having to do it in software rather than in silicon.

The use of SSE registers is more difficult to explain in a way that's satistfactory. The simple answer is that the x64 calling convention requires the first four floating point arguments to a function to be passed at xmm0 through xmm3.

The next question is, of course, why does the calling convention tell you to do this rather than use the floating point stack. The answer is that native x64 code rarely uses the x87 FPU at all, using SSE in replacement. This is because multiplication and division is faster in SSE (the 80-bit vs 64-bit issue again) and that the SSE registers are faster to manipulate (in the FPU you can only access the top of the stack, and rotating the FPU stack is often the slowest operation on a modern processor, in fact some have an extra pipeline stage solely for this purpose).

answered Sep 18 '12 at 12:38

jleahy

16,149
6
47
66

Thanks. Out of interest, XMM registers are 128 (or now 256 with YMM) wide. Is it possible to only push 1 double into the lower 64 bits and have the processor only eval that value? If not, it would be wasting memory bandwidth and electricity. – IamIC Sep 18 '12 at 13:18
1

The instruction movsd is doing exactly that, it's loading into the low 64-bits of XMMn. Additionally mulsd multiplies only the low halves of the registers. – jleahy Sep 18 '12 at 15:02
Makes sense. It's ironic that 32 bit compilers execute more accurate math than 64 bit ones :) – IamIC Sep 18 '12 at 16:14
I assume that all ops have a version that only operates on the lower 64 bits? And I assume that 32 bit singles are converted to 64 for processing? – IamIC Sep 18 '12 at 16:16
3

The first is correct, the second not necessarily so. There are instructions like mulss which multiplies only the single precision float in the lowest 32-bits of XMMn. I'm not certain how compilers tend to handle single precision math. – jleahy Sep 18 '12 at 16:36
1

C and C++ compilers use single-precision instructions for `float` variables and computations; I'd be surprised if .NET compilers would waste instructions on `cvtss2sd` and back, because unlike x87 it's not free to convert. x87 is the weird one (out of FPUs across different architectures); SSE/SSE2 is much more like FP instruction-sets for things like ARM, MIPS, etc. (flat register set, no implicit conversion) – Peter Cordes Apr 04 '21 at 02:27

Why does .NET use SIMD and not x87 for math operations not intrinsic to SIMD?

1 Answers1