28

I know that x87 has higher internal precision, which is probably the biggest difference that people see between it and SSE operations. But I have to wonder, is there any other benefit to using x87? I have a habit of typing -mfpmath=sse automatically in any project, and I wonder if I'm missing anything else that the x87 FPU offers.

John Saunders
  • 160,644
  • 26
  • 247
  • 397
Tom
  • 10,689
  • 4
  • 41
  • 50

5 Answers5

24

For hand-written asm, x87 has some instructions that don't exist in the SSE instruction set.

Off the top of my head, it's all trigonometric stuff like fsin, fcos, fatan, fatan2 and some exponential/logarithm stuff.

With gcc -O3 -ffast-math -mfpmath=387, GCC9 will still actually inline sin(x) as an fsin instruction, regardless of what the implementation in libm would have used. (https://godbolt.org/z/Euc5gp).

MSVC calls __libm_sse2_sin_precise when compiling for 32-bit x86.


If your code spends most of the time doing trigonometry, you may see a slight performance gain or loss if you use x87, depending on whether your standard math-library implementation using SSE1/SSE2 is faster or slower than the slow microcode for fsin on whatever CPU you're using.

CPU vendors don't put a lot of effort into optimizing the microcode for x87 instructions in the newest generations of CPUs because it's generally considered obsolete and rarely used. (Look at uop counts and throughput for complex x87 instructions in Agner Fog's instruction tables in recent generations of CPUs: more cycles than in older CPUs). The newer the CPU, the more likely x87 will be slower than many SSE or AVX instructions to compute log, exp, pow, or trig functions.

Even when x87 is available, not all math libraries choose to use complex instructions like fsin for implementing functions like sin(), or especially exp/log where integer tricks for manipulating the log-based FP bit-patterns are useful.

Some DSP algorithms use a lot of trig, but typically benefit a lot from auto-vectorization with SIMD math libraries.

However, for math-code where you spend most of your time doing additions, multiplications etc. SSE is usually faster.


Also related: Intel Underestimates Error Bounds by 1.3 quintillion - the worst case for fsin (catastrophic cancellation for fsin inputs very near pi) is very bad. Software can do better but only with slow extended-precision techniques.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • @LiraNuna really? I'm not aware of any opcode that directly calculate sin or cos from the SSE instruction set. – Nils Pipenbrinck Jan 07 '10 at 11:10
  • 5
    Please provide a source, Quonux. – asdf Jun 17 '11 at 18:45
  • How much faster is SSE, and in what cases would it matter? Proper language support for x87 (which has been unfortunately lacking for quite awhile) would allow an expression like `d1=d2+d3+d4;` to be computed within 0.501LSB directly; without such support, computing the value to within even 0.75LSB takes a lot more steps. Unless SSE is a *lot* faster than x87, I would think proper x87 support could improve performance more than having faster ways of matched-size arithmetic. – supercat Oct 19 '14 at 18:41
  • Just FYI those instructions for the x87 FPU are listed in Section 5.2.4 of the Intel Developers Manual under "Transcendental Instructions" in the 4 vol set that's page 121: fsin for sine fcos for cosign and like @NilsPipenbrinck said, some logarithm stuff is there as well – Robert Houghton Jul 29 '19 at 19:02
  • @Nils: If you'd rather I posted most of my edit as a separate answer, let me know. Most of what I added was already true in 2009, but x87 more obsolete in 2019. (And compiler support for auto-vectorizing with SIMD math library implementations of `sin()` and `pow` is much better in 2019 so the DSP advantage is extremely questionable. SIMD is usually ideal for DSP stuff.) – Peter Cordes Jul 30 '19 at 05:42
17
  1. It's present on really old machines.

EOF

Simeon Pilgrim
  • 22,906
  • 3
  • 32
  • 45
  • 1
    But not on really *really* old machines - 386 and earlier had the x87 coprocessor as a separate chip which not everyone would have bought, and 486 could be bought with or without the 487 coprocessor on board (486DX vs 486SX). So x87 helps you for the window of time between about 1993 (release of Pentium which always had x87 on board) and 2000 (release of Pentium 4 with SSE2, at which point you could do both single- and double-precision floating point in SSEx.) – Nate Eldredge Dec 21 '21 at 20:24
  • @NateEldredge given question was x87 over SSE, and thus "to target SSE or not" type of question, I suspect the author was not contemplating in 2009 writing software for pre-2000 computers. perhaps they did intend to target "retro computers" all the while not understanding any of the nuance of the plan. But then I hope my tongue-in-check joke answer was valid for the majority the prior 16 (now 28) out of 31 (now 43) years the question could maximally apply to. – Simeon Pilgrim Dec 21 '21 at 20:47
10

FPU instructions are smaller than SSE instructions, so they are ideal for demoscene stuff

Quonux
  • 2,975
  • 1
  • 24
  • 32
  • 2
    I don't buy this; surely serious demo scene programmers compress their instruction streams; domain-specific compression tools should be able to compress SSE instructions just as well as x87 instructions. – Stephen Canon Mar 03 '13 at 18:33
  • @StephenCanon (uncompressed), but your point is right if you/they use any kind of compression – Quonux Jul 27 '13 at 22:05
  • 1
    @StephenCanon: 1-operand stack instructions (x87) have less entropy than 2-operand SSE instructions where neither operand is implicit. The occasional `fxch` probably doesn't outweigh that. I guess it depends on the compression scheme; I haven't looked at what demos actually do. x87 is great for code-golf, though, [e.g. this](https://codegolf.stackexchange.com/questions/101145/stewies-sequence/102741#102741) – Peter Cordes Jul 30 '19 at 04:39
5
  • There is considerable legacy and small system compatibility with the x87: SSE is a relatively new processor feature. If your code is to run on an embedded microcontroller, there's a good chance it won't support SSE instructions.

  • Even systems which don't have an FPU installed will often provide 80x87 emulators which will make the code run transparently (more or less). I don't know of any SSE emulators—certainly one of my systems doesn't have any, so the newest Adobe Photoshop elements versions refuse to run.

  • The 80x87 instructions have good parallel operation characteristics which have been thoroughly explored and analyzed since its introduction in 1982 or so. Various clones of the x86 might stall on an SSE instructions.

wallyk
  • 56,922
  • 16
  • 83
  • 148
  • 2
    So your bottom line is: (a) x87 has good legacy support (b) x87 has been well studied. – Nathan Fellman Dec 06 '09 at 07:12
  • 1
    I'm not 100% positive, but I believe that on many 32-bit processors without an FPU, floating-point math could be done more quickly on 80-bit values than 64-bit values [a 53-bit mantissa and 12-bit exponent are no faster to work with than a 64-bit mantissa and 16-bit exponent, but require extra time to pack and unpack]. I'm really puzzled by why the 80-bit format has been languishing for the last couple decades, since as a *computation* format it would seem superior in every way to a 64-bit double. – supercat Oct 19 '14 at 18:47
  • 1
    No CPU in Agner Fog's testing (https://agner.org/optimize) has SSE but inefficient. If SSE is present, it's always efficient (pipelined add/sub/mul), and SSE division that's not slower than x87 division. Some CPUs break 128-bit SIMD SSE instructions into two 64-bit halves, but scalar SSE/SSE2 is still efficient. So your last point is just being overcautious: nobody bothers to implement *slow* SSE, they just leave it out entirely (e.g. AMD Geode very-low-power CPUs.) – Peter Cordes Jul 30 '19 at 04:51
  • Re: SSE emulators: I'm not sure if https://www.intel.com/content/www/us/en/developer/articles/tool/software-development-emulator.html could emulate SSE on a CPU with only x87. It can emulate newer extensions like AVX on a CPU without them, but it's only fast enough for development / testing, not real use in most cases. [How to test AVX-512 instructions w/o supported hardware?](https://stackoverflow.com/q/51805127) Modern software is expected to detect features and use them if available, so an emulator that advertizes support for stuff it can only do slowly isn't what you want. – Peter Cordes Dec 10 '22 at 22:42
2

Conversion between float and double is faster with x87 (usually free) than with SSE. With x87, you can load and store a float, double or long double to or from the register stack and it is converted to or from extended precision without extra cost. With SSE, additional instructions are required to do the type conversion if types are mixed, because the registers contain float or double values. These conversion instructions are fairly fast but do take extra time.

The real fix is to refrain from mixing float and double excessively, not to use x87, of course.

jilles
  • 10,509
  • 2
  • 26
  • 39