14

I notice compilers generate code that targets SIMD registers every time double arithmetic is used. This applies to non-optimized as well as optimized code. Does this mean the x87 FP unit can be considered obsolete and only present for backward compatibility?

I also notice that other "popular" platforms also rely on their respective SIMD implementations rather than FP designed as a stack.

Also SIMD implementations tend to be at least 128bit wide, so I wonder does that mean the (internal) precision of operations is higher than for the x87 FP unit?

I also wonder about performance, throughput and latency, considering that SIMD has been conceived with vector execution in mind, so I wonder how do they do with scalars.

  • Actually FPU can be more precise, because SIMD stores multiple values in those bits, it currently supports 64 bit doubles and 32 bit floats. There are SSE scalar instructions too. However, FPU has some functions that SSE doesn't. – Jester Oct 09 '14 at 13:56
  • @Jester - do you imply SIMD units do not use the full 128 bits to store the single value? And that instead it is treated as a packed `double` and is only 64 bits? –  Oct 09 '14 at 13:59
  • 2
    It is firmly on the endangered code species list. Everybody took the breaking change in their 64-bit code generators. The big 3 all switched. There are a few stragglers around, the 32-bit .NET jitter for example. High time we can call the 80-bit FPU disaster a distant bad memory. – Hans Passant Oct 09 '14 at 14:24
  • 5
    @HansPassant One could put forward the opposite argument, that is, since the historical FPU has been freed by the switch to SSE2 registers, it is now available again to be used for its original purpose of providing extended precision. There is no reason any more to pretend that a reduced significand width should be the normal state of the historical FPU. – Pascal Cuoq Oct 09 '14 at 14:53
  • @PascalCuoq - what would be better is to invest that idling logic into adding support for 128bit scalars in the SIMD units. But that would require giving up the "legacy stuff" and plaguing creeping backward compatibility. I guess for storage purposes even 64bits would be enough, but for intermediate stages of calculation 128 is simply > 80. –  Oct 09 '14 at 15:01
  • 2
    @HansPassant: 80-bit FPU...disaster? What exactly was disastrous about it? – tmyklebu Oct 09 '14 at 17:01
  • 2
    @tmyklebu Well the three of us know that the x87 is the principal reason why for a long time, it was difficult to implement a programming language that offered simple, double-precision across the line floating-point arithmetics. Reproducibility of results with all compilers and optimization levels are often more important than accuracy, and floating-point gained a reputation of unreliability that plague it to this day. – Pascal Cuoq Oct 09 '14 at 17:13
  • 2
    @tmyklebu Java, as defined in the mid-90s, was difficult to implement for the x87's instructions and it wasn't until C99 (to my knowledge) that a definition of a language that was deterministic and easy to implement with the 387's instruction was proposed. This proposal was implemented in GCC in 2008, and again to the best of my knowledge, GCC was the first C compiler to implement it (and may still be the only one to have implemented it). – Pascal Cuoq Oct 09 '14 at 17:14
  • @PascalCuoq: Doing `double` math with the x87 was always asking for trouble. Java failed to include a `long double` type. Neither of these disasters can really be blamed on the design of the x87, though; using `long double` math in C programs was easy enough (if not entirely portable), and Java didn't exist until well after the 387. The stigma still attached to floating-point math isn't great, I'll agree. – tmyklebu Oct 09 '14 at 17:49
  • @PascalCuoq: What should be difficult about it? Include a `long double` type and specify that all intermediate calculations will be performed as `long double`, *and any register "spillage" is required to save long double values*. I think that's what Turbo Pascal did in the 1980s and it worked just fine. Even if `long double` storage locations generally had to be padded to 12 or 16 bytes (a language/framework could offer structures that e.g. combined one or three values of type `long double` with a value of type `uint16`, for a total size of 12 or 32 bytes). – supercat Oct 13 '14 at 20:40
  • 1
    @supercat you should write compilers. You would do a better job at it than Microsoft's best experts (including those who worked on the C compiler and those that defined .NET). I shall not even mention Clang, which is sponsored by Apple and Google, by fear of offending people. – Pascal Cuoq Oct 13 '14 at 20:48
  • 1
    @PascalCuoq: I would expect people were loath to write a language spec which would dictate semantics for which x87 would be the only efficient implementation. A tragedy, IMHO, since there are many situations where the ability to use a higher-precision type for intermediate results can make many kinds of computation much easier. It's often *much* easier to ensure that a computation yields a result within a few LSBs of being correct than to ensure that it yields a result which is within 9/16LSB of being correct; indeed, in some cases refining that last LSB may take longer... – supercat Oct 13 '14 at 21:03
  • ...than all the other steps of the computation, combined. – supercat Oct 13 '14 at 21:04
  • 1
    @PascalCuoq: In any case, I'm pretty sure Turbo Pascal 5.0 which U used in 1988 implemented semantics equivalent to today's FLT_EVAL_METHOD==2; the fact that so many compilers between then and now have opted for the "I-don't-care" semantics that you've recognized elsewhere as loathsome doesn't mean that efficient and semantically-reasonable implementation should be difficult. – supercat Oct 13 '14 at 21:28
  • @PascalCuoq: When I first read your comment, I thought you were being sarcastic; seeing what you've written elsewhere, perhaps not. Do you have any idea why compilers went from consistently performing intermediate computations as 80-bit values to using a haphazard mix of precisions? Performing all calculations at the highest available precision would seem simpler than trying to perform `float` operations at `float` precision and `double` operations at `double` precision, and in most cases would be as fast or faster than trying to perform operations with multiple precisions. – supercat Oct 14 '14 at 19:18
  • @PascalCuoq: While evaluating `d1=d2+d3;` using `double` alone may be a very tiny bit more precise than using extended precision and rounding, using extended precision for `d1=d2+d3+d4;` is often much more accurate than using `double` without Kahan summation or other such algorithms, and on the x87 would be much faster than using Kahan summation to achieve accurate results using `double` alone. – supercat Oct 14 '14 at 19:22
  • @supercat I was being sarcastic, the sarcasm being directed at implementers of recent compilation platforms. I should however point out that having an undocumented discipline for extended-precision intermediate computations only provides some of the advantages. For maximal benefit, the rules should be documented. Was Turbo Pascal's handling of FP documented? And no, there is no “single obvious choice”. Even C11 needed to clarify something left ambiguous in C99 (`return fpexpr;`) – Pascal Cuoq Oct 14 '14 at 19:53
  • @PascalCuoq: I don't have the manuals handy, so I don't know exactly what they specified about the behavior. What's important is that the compilers never spilled floating-point registers at lower precision. I am bewildered as to why, when using an FPU where the time required for an operation is independent of precision, a compiler would neither deterministically round low-precision intermediates to lower precision, nor consistently keep them at full precision, but would instead try to spill intermediates only at the "required" precision. To me that's doing extra work, for inferior results. – supercat Oct 14 '14 at 21:40
  • 1
    @PascalCuoq: If I were designing a JIT framework, I would only have one set of floating-point ops, and would have floating-point types for single, double, and "best", where "best" could either be 64, 80, or 128 bits depending upon the implementation, but the sum, product, etc. of any floating-point values would always be evaluated as the "best" type whatever it happened to be. – supercat Oct 14 '14 at 21:45

1 Answers1

16

Also SIMD implementations tend to be at least 128bit wide, so I wonder does that mean the (internal) precision of operations is higher than for the x87 FP unit?

The width of a SIMD register is not the width of one individual component of the vector it represents. Widely available SIMD instruction sets offer at most the IEEE 754 binary64 format (64-bit wide). This is not nearly as good as the historical 80-bit extended format for precision or range.

Many C compilers make the 80-bit format available as the long double type. I use it often. It is good to use for most intermediate computations: using it contributes to make the end result more accurate even if the end result is destined to be returned as a binary64 double. One example is the function in this question, for which a mathematically intuitive property holds of the final result if intermediate computations are done with long double, but not if intermediate computations are done with the same double type as the inputs and output.

Similarly, among many constraints that had to be balanced in the choice of the parameters for the extended 80-bit format, one consideration is that it is perfect to compute a binary64 function pow() by composing 80-bit expl() and logl(). The extra precision is necessary in order to obtain a good accuracy for the end-result.

I should note, however, that when the “intermediate” computations are a single basic operation, it is better not to go through extended precision. In other words, when x and y are of type double, the accuracy of (double)(x * (long double)y) is very slightly worse than the accuracy of x * y. The two expressions almost always produce the same results, and in the rare cases where they differ, x * y is very slightly more accurate. This phenomenon is called double-rounding.

Community
  • 1
  • 1
Pascal Cuoq
  • 79,187
  • 7
  • 161
  • 281
  • So there is no 128bit floating point representation for use in SIMD units? –  Oct 09 '14 at 14:01
  • e.g. the "IEEE 754 quadruple-precision binary floating-point format" –  Oct 09 '14 at 14:02
  • 1
    @user3735658 In most 128-bit SIMD instruction sets, a “128-bit register” represents two `binary64` or four `binary32`, and can never be used for one single `binary128` value. – Pascal Cuoq Oct 09 '14 at 14:03
  • 1
    I assume because the hardware doesn't support such operations? 128bit floating point would be nice though... –  Oct 09 '14 at 14:05
  • 2
    @user3735658 It would (and I often use `extended double` as the available second-best solution). GCC also offers software emulation for quad-precision, available as `__float128`. – Pascal Cuoq Oct 09 '14 at 14:09
  • @user3735658: You can certainly put a quad precision value in a SIMD register, but none of MMX, SSE...SSE4.2 provide any instructions for doing math with such a type. All you can do is load and store. – Ben Voigt Oct 09 '14 at 15:00
  • Even though double rounding may slightly degrade precision, there are still situations where it may be semantically useful. For example, it could be useful to have a type system promise that if `t` is implicitly convertible to `U`, and `(U)t` to V, then `(V)t` will be equivalent to `(V)(U)t`. One could uphold such a guarantee while allowing implicit casts from integers to high-precision floating-point types, or from high-precision floating-point types to power-precision ones, but only if casts from integers directly to lower-precision types are defined to be double-rounded. – supercat Oct 13 '14 at 20:59