Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

Question

I was reading today about researchers discovering that NVidia's Phys-X libraries use x87 FP vs. SSE2. Obviously this will be suboptimal for parallel datasets where speed trumps precision. However, the article author goes on to quote:

Intel started discouraging the use of x87 with the introduction of the P4 in late 2000. AMD deprecated x87 since the K8 in 2003, as x86-64 is defined with SSE2 support; VIA’s C7 has supported SSE2 since 2005. In 64-bit versions of Windows, x87 is deprecated for user-mode, and prohibited entirely in kernel-mode. Pretty much everyone in the industry has recommended SSE over x87 since 2005 and there are no reasons to use x87, unless software has to run on an embedded Pentium or 486.

I wondered about this. I know that x87 uses 80-bit extended doubles internally to compute values, and SSE2 doesn't. Does this not matter to anyone? It seems surprising to me. I know when I do computations on points, lines and polygons in a plane, values can be surprisingly wrong when doing subtractions, and areas can collapse and lines alias one another due to lack of precision. Using 80-bit values vs. 64-bit values could help, I would imagine.

Is this incorrect? If not, what can we use to perform extended double FP operations if x87 is phased out?

Not really an answer to your question, but personally I'm hoping for the 128-bit IEEE 754 binary format to become mainstream. — Mark Dickinson, Jul 08 '10 at 20:45
@Mark - seriously, just what is taking so long? AVX may be a standard before that gets out... — codekaizen, Jul 08 '10 at 21:05
[This](https://www.cs.uaf.edu/2012/fall/cs301/lecture/11_02_other_float.html) is a good answer on what was the reason to discourage x87. And yes, SSE calculations are less precise, it is clearly seen on modern JIT-compilers (compared to traditional x87-based compilers). — Egor Skriptunoff, Jun 08 '15 at 00:55

score 33 · Accepted Answer · edited Jul 10 '10 at 22:18

33

The biggest problem with x87 is basically that all register operations are done in 80 bits, whereas most of the time people only use 64 bit floats (i.e. double-precision floats). What happens is, you load a 64 bit float into the x87 stack, and it gets converted to 80 bits. You do some operations on it in 80 bits, then store it back into memory, converting it into 64 bits. You will get a different result than if you had done all the operations with just 64 bits, and with an optimizing compiler it can be very unpredictable how many conversions a value might go through, so it's hard to verify that you're getting the "correct" answer when doing regression tests.

The other problem, which only matters from the point of view of someone writing assembly (or indirectly writing assembly, in the case of someone writing a code generator for a compiler), is that the x87 uses a register stack, whereas SSE uses individually accessible registers. With x87 you have a bunch of extra instructions to manipulate the stack, and I imagine Intel and AMD would rather make their processors run fast with SSE code than trying to make those extra stack-manipulation x87 instructions run fast.

BTW if you are having problems with inaccuracy, you will want to take a look at the article "What every programmer should know about floating-point arithmetic", and then maybe use an arbitrary precision math library (e.g. GMP) instead.

edited Jul 10 '10 at 22:18

codekaizen

26,990
7
84
140

answered Jul 10 '10 at 15:43

tsuyoshi

408
4
4

9

Optimizing compilers are bad enough, but try a JIT that has the ability to inline small methods (and therefore vary the number of in-memory temps). Sometimes I call this method and get one answer, sometimes I call the same method with the exact same arguments and get a different result, depending on whether the JITter inlined the call or not! That was a fun regression to track down. – Joe White Jul 10 '10 at 16:18
Yes, I see, that does get complicated with compilers making these kinds of choices, moreso when JIT compilers do it. As to precision, I currently scale the number to [0..1] and remove common bits to decrease the noise due to bits just cancelling, and just imagined that 80 bits would give me more room. While true, apparently, the side-effects are too high of a cost. I hope to test it on QP hardware... whenever that shows up. – codekaizen Jul 10 '10 at 22:17
@Joe White If you're using java and you NEED exactly the same results every time you do floating point math, investigate the use of the `strictfp` keyword. This forces math to be IEEE 754 and not whatever the native platform does (x87 on 32b intel for instance). http://en.wikipedia.org/wiki/Strictfp – KitsuneYMG Jan 10 '11 at 05:05
1

@KitsuneYMG, I'm actually using .NET. As far as I'm aware, there's no equivalent there. :( – Joe White Jan 10 '11 at 20:41
2

It's worth mentioning that the 80-bit precision never was intended for storage. It was deliberately designed to serve as a higher-precision intermediate representation that would be converted back to float or double when the results are being stored. – ArchaeaSoftware Dec 23 '12 at 02:21
3

Would anything prevent an 8x87 compiler from keeping all intermediate results as 80-bit values, whether they fit in registers or not, and specifying that it will do so? Would results from such a compiler not be entirely reproducible with any other compiler that did likewise? – supercat Sep 12 '13 at 18:43
@supercat If the x87 compiler complies with the CLI spec then it must truncate higher-precision values when there is an explicit conversion instruction. Even if we're not talking about the CLI, one must define "intermediate result." If a function returns a double, the return value is presumably not intermediate. But what if the function is inlined? Different compilers will presumably make different decisions about inlining. If the return value of an inlined function doesn't need to be truncated, then different compilers can give different results. – phoog Dec 09 '13 at 23:14
@phoog: Some machines/compilers used 80-bit math internally, but would arbitrarily convert values to 64-bit `double` any time they couldn't fit in registers, so if `someDouble=f1()*f2()+f3()*f4()` was evaluated in left-to-right sequence, it might round `f1()*f2()` to a `double` but not round f3()*f4() [since no more function calls would be needed between the time it was computed and the time `someDouble` was stored]. That sort of behavior is icky and nasty. But if the rules for when things were rounded were independent of what did or did not fit in registers, I wouldn't see a problem. – supercat Dec 09 '13 at 23:32
1

@phoog: Personally, what I'd like to see would be a language with distinct types for e.g. `ieee float`, `fast float`, and `short real`, where the product of two IEEE floats would *always* be rounded to `float` while `fast float` would be rounded or not as convenient. A `short real` would be a 32-bit floating-point value, but would be converted to the maximum precision type when performing math it if such conversion could improve the precision of the result [e.g. conversion would be required when computing `f1=f2+f3+f4;`, but not `f1=f2+f3;`]. – supercat Dec 09 '13 at 23:39
@phoog: Given that floating-variables are used in a number of different ways, having different types for different usage patterns would allow language designers to provide useful warnings in cases where a programmer who wants strict IEEE single-precision semantics accidentally multiplies by 1.01 rather than 1.01f, while allowing a programmer who wants to as accurately as possible multiply a single-precision float by 1.01 to do so without ugly typecasts. – supercat Dec 09 '13 at 23:43
1

Note that the x87 FPU actually has a control word which allows you to reduce the internal precision to 64 bit or even 32 bit to get bitwise identical results, but nobody seems to use that. – fuz Oct 16 '17 at 14:26
1

@fuz: according to Bruce Dawson, MSVC used to reduce to 64-bit (53-bit significand) in its CRT startup. https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/ And DirectX apparently used to reduce it to `float` precision for your whole process! – Peter Cordes Mar 08 '19 at 12:19

score 7 · Answer 2 · edited Apr 04 '21 at 06:26

To make proper use of extended-precision math, it's necessary that a language support a type which can be used to store the result of intermediate computations, and can be substituted for the expressions yielding those results. Thus, given:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  printf("%12.6f", (x2-x1)*(x2-x1)+(y2-y1)*(y2-y1));
}

there should be some type that could be used to capture and replace the common sub-expressions x2-x1 and y2-y1, allowing the code to be rewritten as:

void print_dist_squared(double x1, double y1, double x2, double y2)
{
  some_type dx = x2-x1;
  some_type dy = y2-y1;
  printf("%12.6f", dx*dx + dy*dy);
}

without altering the semantics of the program. Unfortunately, ANSI C failed to specify any type which could be used for some_type on platforms which perform extended-precision calculations, and it became far more common to blame Intel for the existence of extended-precision types than to blame ANSI's botched support.

In fact, extended-precision types have just as much value on platforms without floating-point units as they do on x87 processors, since on such processors a computation like x+y+z would entail the following steps:

Unpack the mantissa, exponent, and possibly sign of x into separate registers (exponent and sign can often "double-bunk")
Unpack y likewise.
Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
In case x and y had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
Pack the exponent and mantissa back into double format.
Unpack the that temporary result.
Unpack z.
Right-shift the mantissa of the value with the lower exponent, if any, and then add or subtract the values.
In case the earlier result and z had different signs, left-shift the mantissa until the leftmost bit is 1 and adjust the exponent appropriately.
Pack the exponent and mantissa back into double format.

Using an extended-precision type will allow steps 4, 5, and 6 to be eliminated. Since a 53-bit mantissa is too large to fit in less than four 16-bit registers or two 32-bit registers, performing an addition with a 64-bit mantissa isn't any slower than using a 53-bit mantissa, so using extended-precision math offers faster computation with no downside in a language which supports a proper type to hold temporary results. There is no reason to fault Intel for providing an FPU which could perform floating-point math in the fashion that was also the most efficient method on non-FPU chips.

Right, but I think we *can* fault Intel for not providing a way to do standards-compliant correctly-rounded basic arithmetic operations (on 64-bit doubles) *at all*. Yes, you can change the FPU precision to 53 bits instead of 64 bits, but that's clunky, slow, risks interfering with library code that expects the 64-bit precision, and doesn't even solve the problem: while it eliminates double rounding in the normal domain, it doesn't change the exponent range, so still leaves the possibility of double rounding on underflow. SSE(2) is a big improvement in this respect. — Mark Dickinson, Sep 23 '15 at 05:43
@MarkDickinson: While there are specialized applications which require bit-consistent floating-point behavior with operations involving shorter types, for most applications it is better to have proper support for extended precision. I see SSE(2) and x87 as serving different purposes, and would have liked to have seen languages support them both eagerly-promoting and strict floating-point types; further, expressions involving strict types should IMHO only be convertible to larger types after "visibly" coercing them to their own type, so if f1 and f2 were strict float types, `d1=f1*f2`... — supercat, Sep 23 '15 at 16:24
...would need to be written as `d1=(float)(f1*f2);` [not `d1=(double)(f1*f2);`!]. I would guess that in cases where someone writes `d1=f1*f2;` there is a very high likelihood that (1) the code would either have been intended to say `d1=(double)f1*f2;`, (2) a programmer who sees the code thinks it means that, or (3) a programmer who sees the code thinks it was intended to mean that. Requiring the code to be written as `d1=(float)(f1*f2);` in cases where that behavior is intended would eliminate those dangers. — supercat, Sep 23 '15 at 16:28
@marcin: It is, and I'd suggest many people's dislike for it is a consequence of languages' poor treatment of it. The design intention of C was that unsuffixed literals be the highest-precision type, and variadic function arguments should promote to the highest-precision type, so code like "printf("%9.4f/%9.4f", x, y*Y_SCALE);` wouldn't have to worry about the type of `Y_SCALE`, and even if the same value of `Y_SCALE` was sometimes used in `float` and `double` calculations. Having a `long double` type which isn't interchangeable in `printf` makes things awkward, as does... — supercat, Apr 22 '16 at 14:45
...having a declaration like `long double d=0.1;` set `d` to 0.10000000000000000555 rather than 0.10000000000000000000813151629364. — supercat, Apr 22 '16 at 14:49

score 4 · Answer 3 · answered Sep 21 '15 at 15:39

4

The other answer seems to suggest that using 80-bit precision is a bad idea, but it isn't. It performs a sometimes vital role in keeping imprecision at bay, see e.g. the writings of W. Kahan.

Always use 80-bit intermediate arithmetic if you can get away with it speed-wise. If that means you have to use x87 maths, well, do so. Support for it is ubiquitous and as long as people keep doing the right thing, it will remain ubiquitous.

answered Sep 21 '15 at 15:39

Anonymous

161
1
2

4

Though, somewhat ironically, the intermediate 64-bit precision (*not* 80-bit precision) from use of the 80-bit x87 registers can lead to *less* accurate results for simple arithmetic operations on regular 53-bit doubles. Assuming the usual round-ties-to-even rounding mode, the operation `1e16 + 2.9999` on IEEE 754 binary64 values gives a correctly-rounded result of `10000000000000002.0` on a machine using SSE2, but an incorrectly-rounded result of `10000000000000004.0` when using x87 with FPU precision not altered from its default of 64-bit precision, thanks to double rounding. – Mark Dickinson Sep 21 '15 at 17:05
3

There are a few cases where using double-precision to compute x+y would yield a result with a round-off error of 1/2ulp, while using extended-precision and converting to double would yield a round-off error of2049/4096ulp. On the other hand, there are a lot more cases where using extended-precision to compute x+y+z will yield an accurate result, while using "double" will yield a result that's *far* less accurate, or in some cases Just Plain Wrong. – supercat Sep 21 '15 at 23:04

user6801759 · Answer 4 · 2023-02-25T12:47:29.463

0

Double precission is 11 bits less than f80 (about 2.5 nibbles/digits), for many app (mostly games) it wouldn't hurt. But you will need all the accuracy available for say, space program or medical app.

It's a bit misleading when some say that f80 (and discouraged by it) operating on stack. FPU registers and operations similar to stack operation, maybe that what makes people confused. It actually memory based (load/store), not stack per-se, compared to, for instance, calling convention like cdecl stdcall which do actually passing parameters via stack. and nothing wrong with that.

The big advantage of SSE actually is parallel-izing operation, 2, 4, 8 values at once, with many varian operations. Yes you can directly transfer to register, but you will transfer that values to memory anyway at the end.

The big disadvantage of f80 is, its odd 10 byte long, it disrupt alignment. you'd have to align them 16 for faster access. but not really practicable for array.

You still have to use fpu for trigonometric and other trancedental math operations. For asm, there's many f80 tricks that really fun and useful.

For games and regular simple app (nearly all), you can just used double without getting someone died. But for a few serious, math or scientific app you just can't ditch f80.

EDIT: incorrect choice of word: "serial" which should have been "parallel"

edited Feb 25 '23 at 12:47

answered Sep 09 '16 at 01:51

user6801759

95
3

1

`serialize operation`. You mean "parallel operation". Or SIMD operation. – Peter Cordes Sep 09 '16 at 01:59
3

`You still have to use fpu for trigonometric and other trancedental math operations`. If you mean x87 FSIN, [FYL2X](http://www.felixcloutier.com/x86/FYL2X.html) (log2), etc. then no, that's incorrect. Math libraries implement those functions in software, with SSE math. – Peter Cordes Sep 09 '16 at 02:06
2

Even before x87 was obsolete, good math libraries didn't use FSIN, because the internal value of Pi used for range reduction isn't accurate enough; only 66 bits. Intel isn't able to change this, for backwards compat reasons, but [FSIN has large errors near +/- pi/2](https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/) – Peter Cordes Sep 09 '16 at 02:06
Yes. sorry, I meant parallel. Emulation is always much much slower. in fact that's we did before numeric processor existed. See Kahan notes on IEEE 754 design rationale https://en.wikipedia.org/wiki/Floating_point#IEEE_754_design_rationale: "This Extended format is designed to be used, with negligible loss of speed,.." But for pragmatical reason (faster machine, larger capacity in everything), I guess no one bothers with used to be slow and bloated code anymore. – user6801759 Sep 09 '16 at 23:28
1

About PI, you might see http://www.jpl.nasa.gov/edu/news/2016/3/16/how-many-decimals-of-pi-do-we-really-need/ Multiprecission sure is nice, but it's for fun and exercise only. – user6801759 Sep 09 '16 at 23:28
2

Emulating `fsin` in software is not *much slower*. The internal implementation is [micro-coded with 71-100 uops (on Intel Haswell), with a total latency of 47-106 cycles](http://agner.org/optimize), and (in this case) doesn't do anything that can't be done with simple x86 instructions that each decode to only a single uop. And re: Pi precision, the article you linked doesn't say anything about catastrophic cancellation or floating point problems. Did you even read Bruce Dawson's article that I linked earlier? Have you heard of catastrophic cancellation? – Peter Cordes Sep 09 '16 at 23:41
BTW, welcome to Stack Overflow. You should [edit] your correction ("parallel") into the answer. – Peter Cordes Sep 10 '16 at 00:21
11 bits means a factor 2048: more than 3 digits...? – Michel de Ruiter Dec 20 '21 at 21:03
Update to my old comments: you're probably right that most software implementations don't use extended-precision Pi for range reduction either. So they might not be more precise than `fsin`, but using SSE2 scalar math they're not slower or much if any less precise. Math library implementations of `sin()` exist for all architectures, most of which don't have a hardware instruction for it. – Peter Cordes Feb 25 '23 at 18:39
x87 does use a stack of 8 registers. It's separate from **the** stack, pointed to by ESP / RSP, that's memory. But the st0..7 registers are a stack. (Or more accurately a circular buffer. http://www.ray.masmcode.com/tutorial/fpuchap1.htm . You're correct about the alignment downside, but you're over-stating the importance of `long double` for scientific computing. A lot of scientific computing works just fine with 64-bit `double`, and the full 80-bit internal precision was only used for temporaries in code compiled for x87 unless it used `long double`. – Peter Cordes Feb 26 '23 at 02:01
Also, MSVC on Windows set the x87 precision-control field to round to `double` (53-bit mantissa) so there was no easy way to use the full 80-bit (64-bit mantissa) precision in a Windows program. See https://randomascii.wordpress.com/2012/03/21/intermediate-floating-point-precision/. And many other ISAs never provided extended-precision floats, yet number crunching code was able to work on them with good-enough results, with sufficient care in numerical algorithms. (Often yes 80-bit temporaries do reduce rounding error, but at the downside of enabling optimization changing the results.) – Peter Cordes Feb 26 '23 at 02:05

Extended (80-bit) double floating point in x87, not SSE2 - we don't miss it?

4 Answers4

Linked