IEEE754 division precision

Question

From IEEE754, I read

[...] every operation [...] shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result [...].

My understanding is when dividing the double 1.0108552519184509e+76 (0x4FB6593CEBC97CC5) by 4.1777521369084075e+147 (0x5E94E917A9CC65DC), the theoretical intermediate fraction part is (binary)

1.0001000110011011000100110000110101001010110111101110100000000000001...

and should get rounded to (rounding mode "nearest")

1.0001000110011011000100110000110101001010110111101111

resulting in a quotient of 2.41961518728705e-72 (0x311119B130D4ADEF).

One SW here yields 2.4196151872870495e-72 (0x311119B130D4ADEE) which seems to indicate it calculates the intermediate fraction only up to a certain position, e.g.

1.000100011001101100010011000011010100101011011110111010000000000

and then rounds.

Is this compliant with IEEE754? Is it a common approach?

IEEE 754 does not allow “double rounding” (the name of what is apparently happening here). However some languages (C99, C++) can be specified in a way that invites double-rounded results, while other languages forbid them (C99 with other compiler-defined parameters, Java). You should tag your question with a programming language, because the programming language makes all the difference here. — Pascal Cuoq, Jan 09 '15 at 10:46
@PascalCuoq: Than you for your comment. I am interested in IEEE754 compliance, not in the specific language. It is Ruby 2.0.0 which claims to be using "the native architecture's double-precision floating point representation". — undur_gongor, Jan 09 '15 at 10:53
In this case the answer is “no, this apparently double-rounded result is not IEEE 754-compliant”. Note that there is a difference between using IEEE 754 representation (which I'm sure your language does) and offering IEEE 754 operations on this representation. — Pascal Cuoq, Jan 09 '15 at 11:00
"Double rounding" was the right keyword. It led me to this page http://www.network-theory.co.uk/docs/gccintro/gccintro_70.html which explains it all. If I configure the FPU as described there, the result is correct. — undur_gongor, Jan 09 '15 at 11:34

score 3 · Accepted Answer · edited May 23 '17 at 12:28

After a request for clarification, the question is about IEEE 754, independently of a programming language. In this context, obtaining the result 2.4196151872870495e-72 for the division being considered, in “round-to-nearest”, is purely and simply incorrect. The correct result is 2.41961518728705e-72, according to the definition found in the question:

[...] every operation [...] shall be performed as if it first produced an intermediate result correct to infinite precision and with unbounded range, and then rounded that result [...].

What happened in practice is that most programming language implementations, and often specifications, do not put a lot of emphasis on the strict respect of IEEE 754 semantics for floating-point operations. Even when the IEEE 754 double-precision representation is used for storage of floating-point values, operations can end up being implemented as:

if the arguments aren't already 80-bit floating-point values with 64-bit significands, conversion from double-precision to this format. This does not lose precision and would not be a problem in itself
computation of a 80-bit result from the 80-bit operands, because this is what is easy without extra effort when computing with the 8087 instruction set
just after that or later, conversion (in other words, rounding) of the 80-bit value with its 64-bit significand to a double-precision value with a 53-bit significand.

In some cases the last step does not take place immediately but at the whim of the compiler. This is particularly annoying because it makes the code non-deterministic. Adding separate debugging code that should not affect computations does change them by changing the availability of 80-bit registers and causing some of them to be spilt and rounded to double-precision.

Even when storage in double-precision happens immediately for each intermediate result, there remains the issue that the result has been computed, and correctly rounded, for a significand of 64 bits, and then rounded again to 53 bits. In some cases, the mathematical result is close to the midpoint between two double-precision value, and rounding it to 64 bits of significand drags it to the exact middle. If this result with its 64-bit significand is then rounded to 53 bits, the end result is a different value than the direct application of the IEEE 754 rule would have produced. This only happens when the mathematical result is very close to the midpoint between two double-precision numbers, so that the two answers are both almost equally accurate answers, but one of them is what the IEEE 754 standard says and not the other.

The article The pitfalls of verifying floating-point computations makes good further reading.

Notes:

As mentioned by Patricia in her answer, the reason that IEEE 754 specifies that +, -, *, / and √ should compute as if the mathematical result, sometimes with infinite digits, had been computed and then rounded, is that algorithms exist to obtain this result without computing the entire mathematical result. When no algorithms are known to obtain this “correctly rounded” result cheaply, for instance for trigonometric functions, the standard does not mandate it.

Since you found a solution on a page that explains how to configure the 387 FPU to round directly at 53 bits of significand, I should point out that double-rounding problems can remain even after this configuration, although much rarer. Indeed, while the significand of the FPU can be limited to 53 bits, there is no equivalent way to limit the exponent. A double-precision operation that produces a subnormal result will tend to be double-rounded when computed on the 387 even in 53-bit-significand mode. This caused me to ask this question about how Java implementations implement multiplication on the 387.

I know I shall not post "thank you" comments but ... Thank you for your effort. The last paragraph answers the next question I was going to ask already. — undur_gongor, Jan 09 '15 at 19:09
It may be worth noting that, in languages which properly support it, using extended-precision intermediate computations can sometimes cause two-operand computations to have just over 0.5ulp rounding error, but on the flip side it can greatly reduce rounding errors for three-operand computations (e.g. x+y+z). Further, on 16- or 32-bit processors without floating-point units (common in embedded systems) a compiler-supported extended-precision type may be *faster* than an IEEE double (working with a 64-bit mantissa is no slower than working with a 53-bit one, and avoids bit packing/unpacking). — supercat, Jan 14 '15 at 00:27

score 1 · Answer 2 · answered Jan 09 '15 at 10:45

1

Some languages permit extra precision, and that seems to be what is happening here. I did the division of the exact representations of the inputs to 1000 decimal places using Java's BigDecimal. The result began with "2.419615187287049816675514541262468407091280398183303735778952998096290304758722566", which is slightly closer to the lower value.

Whether extra precision is permitted in a given calculation is a matter for the language specification.

In general, floating point arithmetic uses guard digits to get the same result as though the calculation had been done exactly and then rounded. To do the normal round to nearest the system needs to know one bit beyond the bits that will be kept, as well as an indication of whether any lower significance bit is one.

answered Jan 09 '15 at 10:45

Patricia Shanahan

25,849
4
38
75

I am not sure I would have included the discussion of *how* the illusion of infinitely precise results is obtained in practice. It is not really in the question. Or in the answer: what is likely happening is that the OP's language first computed the result correctly rounded to, say, 64 bits of significand and then rounded this intermediate result to 53 bits of significand. I guess that you are including it *because* the error is not caused by some sort of physical impossibility to compute correctly rounded divisions, but perhaps introduce the paragraph with a higher-level sentence like this? – Pascal Cuoq Jan 09 '15 at 10:57
I'm quite sure the input values are ordinary 64 bit doubles with no extra precision (Ruby `Float`s). – undur_gongor Jan 09 '15 at 11:15
@PascalCuoq I thought the OP seemed to think that floating point units might have trouble rounding according to the IEEE spec. The fact that there are well-known algorithms requiring only two bits of extra data argues against that. – Patricia Shanahan Jan 09 '15 at 14:17

IEEE754 division precision

2 Answers2