Properties of 80-bit extended precision computations starting from double precision arguments

Question

Here are two implementations of interpolation functions. Argument u1 is always between 0. and 1..

#include <stdio.h>

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - u1) + u1 * u3;  
}

double interpol_80(double u1, double u2, double u3)
{ 
  return u2 * (1.0 - (long double)u1) + u1 * (long double)u3;  
}

int main()
{
  double y64,y80,u1,u2,u3;
  u1 = 0.025;
  u2 = 0.195;
  u3 = 0.195;
  y64 = interpol_64(u1, u2, u3);
  y80 = interpol_80(u1, u2, u3);
  printf("u2: %a\ny64:%a\ny80:%a\n", u2, y64, y80);
}

On a strict IEEE 754 platform with 80-bit long doubles, all computations in interpol_64() are done according to IEEE 754 double precision, and in interpol_80() in 80-bit extended precision. The program prints:

u2: 0x1.8f5c28f5c28f6p-3
y64:0x1.8f5c28f5c28f5p-3
y80:0x1.8f5c28f5c28f6p-3

I am interested in the property “the result returned by the function is always in-between u2 and u3”. This property is false of interpol_64(), as shown by the values in the main() above.

Does the property have a chance to be true of interpol_80()? If it isn't, what is a counter-example? Does it help if we know that u2 != u3 or that there is a minimum distance between them? Is there a method to determine a significand width for intermediate computations at which the property would be guaranteed to be true?

EDIT: on all the random values I tried, the property held when intermediate computations were done in extended precision internally. If interpol_80() took long double arguments, it would be relatively easy to build a counter-example, but the question here is specifically about a function that takes double arguments. This makes it much harder to build a counter-example, if there is one.

Note: a compiler generating x87 instructions may generate the same code for interpol_64() and interpol_80(), but this is tangential to my question.

Are you sure this program really uses 80 bits of precision? IIRC modern Intel / AMD machines have builtin 128 fp units coming with SSE and friends. — fuz, Dec 05 '12 at 14:57
@FUZxxl “128-bit FP units” means vectors of two double-precision or 4 single-precision numbers. But to answer your question, yes, I am sure. The assembly is here: http://pastebin.com/GaM20WZS — Pascal Cuoq, Dec 05 '12 at 15:00

aka.nice · Accepted Answer · 2012-12-07T21:10:18.977

Yes, interpol_80() is safe, let's demonstrate it.

The problem states that inputs are 64bits float

rnd64(ui) = ui

The result is exactly (assuming * and + are mathematical operations)

r = u2*(1-u1)+(u1*u3)

Optimal return value rounded to 64 bit float is

r64 = rnd64(r)

As we have these properties

u2 <= r <= u3

It is guaranteed that

rnd64(u2) <= rnd64(r) <= rnd64(u3)
u2 <= r64 <= u3

Conversion to 80bits of u1,u2,u3 are exact too.

rnd80(ui)=ui

Now, let's assume 0 <= u2 <= u3, then performing with inexact float operations leads to at most 4 rounding errors:

rf = rnd(rnd(u2*rnd(1-u1)) + rnd(u1*u3))

Assuming round to nearest even, this will be at most 2 ULP off exact value. If rounding is performed with 64 bits float or 80 bits floats:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)
r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

rf64 can be off by 2 ulp so interpol-64() is unsafe, but what about rnd64( rf80 )?
We can tell that:

rnd64(r - 2 ulp80(r)) <= rnd64(rf80) <= rnd64(r + 2 ulp80(r))

Since 0 <= u2 <= u3, then

ulp80(u2) <= ulp80(r) <= ulp80(r3)
rnd64(u2 - 2 ulp80(u2)) <= rnd64(r - 2 ulp80(r)) <= rnd64(rf80)
rnd64(u3 + 2 ulp80(u3)) >= rnd64(r + 2 ulp80(r)) >= rnd64(rf80)

Fortunately, like every number in range (u2-ulp64(u2)/2 , u2+ulp64(u2)/2) we get

rnd64(u2 - 2 ulp80(u2)) = u2
rnd64(u3 + 2 ulp80(u3)) = u3

since ulp80(x)=ulp62(x)/2^(64-53)

We thus get the proof

u2 <= rnd64(rf80) <= u3

For u2 <= u3 <= 0, we can apply same proof easily.

The last case to be studied is u2 <= 0 <= u3. If we subtract 2 big values, then result can be up to ulp(big)/2 off rather than ulp(big-big)/2...
Thus this assertion we made doesn't hold anymore:

r - 2 ulp64(r) <= rf64 <= r + 2 ulp64(r)

Fortunately, u2 <= u2*(1-u1) <= 0 <= u1*u3 <= u3 and this is preserved after rounding

u2 <= rnd(u2*rnd(1-u1)) <= 0 <= rnd(u1*u3) <= u3

Thus since added quantities are of opposite sign:

u2 <= rnd(u2*rnd(1-u1)) + rnd(u1*u3) <= u3

same goes after rounding, so we can once again guaranty

u2 <= rnd64( rf80 ) <= u3

QED

To be complete we should care of denormal inputs (gradual underflow), but I hope you won't be that vicious with stress tests. I won't demonstrate what happens with those...

EDIT:

Here is a follow-up as the following assertion was a bit approximative and generated some comments when 0 <= u2 <= u3

r - 2 ulp80(r) <= rf80 <= r + 2 ulp80(r)

We can write the following inequalities:

rnd(1-u1) <= 1
rnd(1-u1) <= 1-u1+ulp(1)/4
u2*rnd(1-u1) <= u2 <= r
u2*rnd(1-u1) <= u2*(1-u1)+u2*ulp(1)/4
u2*ulp(1) < 2*ulp(u2) <= 2*ulp(r)
u2*rnd(1-u1) < u2*(1-u1)+ulp(r)/2

For next rounding operation, we use

ulp(u2*rnd(1-u1)) <= ulp(r)
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(u2*rnd(1-u1))/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)/2 + ulp(r)/2
rnd(u2*rnd(1-u1)) < u2*(1-u1)+ulp(r)

For second part of the sum, we have:

u1*u3 <= r
rnd(u1*u3) <= u1*u3 + ulp(u1*u3)/2
rnd(u1*u3) <= u1*u3 + ulp(r)/2

rnd(u2*rnd(1-u1))+rnd(u1*u3) < u2*(1-u1)+u1*u3 + 3*ulp(r)/2
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 3*ulp(r)/2 + ulp(r+3*ulp(r)/2)/2
ulp(r+3*ulp(r)/2) <= 2*ulp(r)
rnd(rnd(u2*rnd(1-u1))+rnd(u1*u3)) < r + 5*ulp(r)/2

I didn't prove the original claim, but not that far...

Your answer helps me think more clearly about my own question, but there is something I do not understand yet. When I try to compute myself a bound for the difference between mathematical and floating-point versions of the expression `u2*(1-u1)+(u1*u3)`, I get `ulp(u2) + ulp(u3) + ulp(u2 + u3)`, the first term being the error of `u2*(1-u1)`, the second the error of `(u1*u3)` and the third the error introduced by the product. Your result of 2 ulps seems better, but I am not sure how you derive it... — Pascal Cuoq, Dec 05 '12 at 22:09
@PascalCuoq you are right, this was a bit fast... With assumption 0<=u2<=u3, all terms are positive and are inferior in magnitude to their sum r, so ulp(u2*(1-u1))+ulp(u3*u1)<=2*ulp(r), and rounding is bounded by ulp/2 after basic operations... You also have one rounding error when performing rnd(1-u1) — aka.nice, Dec 05 '12 at 22:21
Oh, regarding denormals, there is no need to worry: when `u1`, `u2` and `u3` are double-precision numbers, then none of the sub-expressions of `u2 * (1.0 - (long double)u1) + u1 * (long double)u3` can be a `long double` denormal. — Pascal Cuoq, Dec 05 '12 at 22:49
I do not see why the assertion that the error in rf is at most 2 ULP is true. The error in 1-u1 may be up to ULP(1)/4. Then it is multiplied by u2, so the error may be up to |u2|•ULP(1)/4. This error is carried into rf. If |u2| is huge and rf is small, the error is many ULP. — Eric Postpischil, Dec 06 '12 at 15:01
However, the proof can be fixed in light of this. If |u2|>>|r|, then clearly u2 ≤ rnd(rf80). So we just have to consider whether rnd(rf80) ≤ u3. This is only a concern if r is near u3, so u1 is near 1. When u1 is near 1, 1-u1 is exact, so the error from that operation is zero. — Eric Postpischil, Dec 06 '12 at 15:04
@EricPostpischil agree - this part was too approximative, I checked error bounds more accurately, it is quite fastidious, but since u2*rnd(1-u1)<=u2<=r and u2*ulp(1)<=2*ulp(u2) we can still bound rnd(u2*rnd(1-u1)) error to a few ulp(r) — aka.nice, Dec 06 '12 at 17:59
@aka.nice: In double, let u2 = `-.1`, u3 = 103, u1 = -u2/(u3-u2). (That value for u1 is calculated exactly, then rounded to a double. u2 is 0x1.999999999999ap-4, u1 is 0x1.fc86155aa1659p-11.) Then r is about -4.342266507e-18, but calculating it in long double produces about -4.343584954e-18 (exactly -0x1.408p-58). The error is about 1,367,432,361,508 ULP. I think that is more than a few, although I do not have a reference handy to support that claim. — Eric Postpischil, Dec 06 '12 at 19:09
@EricPostpischil true, but it falls in branch u2 <= 0 <= u3 for which I didn't give any error bound — aka.nice, Dec 06 '12 at 21:29

R.. GitHub STOP HELPING ICE · Answer 2 · 2012-12-05T15:37:21.897

3

The main source of loss-of-precision in interpol_64 is the multiplications. Multiplying two 53-bit mantissas yields a 105- or 106-bit (depending on whether the high bit carries) mantissa. This is too large to fit in an 80-bit extended precision value, so in general, you'll also have loss-of-precision in the 80-bit version. Quantifying exactly when it happens is very difficult; the most that's easy to say is that it happens when rounding errors accumulate. Note that there's also a small rounding step when adding the two terms.

Most people would probably just solve this problem with a function like:

double interpol_64(double u1, double u2, double u3)
{ 
  return u2 + u1 * (u3 - u2);
}

But it looks like you're looking for insight into the rounding issues, not a better implementation.

edited Dec 05 '12 at 15:37

answered Dec 05 '12 at 15:24

R.. GitHub STOP HELPING ICE

208,859
35
376
711

`u1` is 0.025, not 0.25, so it has more bits set, the mantissa is 1999999999999a. – Daniel Fischer Dec 05 '12 at 15:31
@R.: `u1` is .025, not .25; its significand (not mantissa) has more than one bit set. And the question is not how to alter the computation to produce results in range, the question is under what circumstances the computation may be out of range. – Eric Postpischil Dec 05 '12 at 15:32
I edited my answer to better match what OP seems to be looking for, but now it's not a very satisfying answer. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 15:38
The question is in the context of arguing that compilers with `FLT_EVAL_METHOD = 0` are better because more predictable. The property at hand makes it difficult to argue this since it's false there, and true when the compiler silently compiles `interpol_64()` as if it were `interpol_80()` (say, a `FLT_EVAL_METHOD = 2` compiler such as modern GCC targeting x87, or a I-don't-care compiler such as an old GCC targeting x87). You are right, changing the source code is not the question here. Besides, does `u2 + u1 * (u3 - u2)` not have the same issue for well-chosen `u2`, `u3` and `u1=1` ? – Pascal Cuoq Dec 05 '12 at 16:28
I think you're misinterpreting "more predictable". The word "predictable" in this context doesn't mean "giving you the answer you naively expect". It means the results are independent of whether the compiler needs/chooses to spill registers during a computation. `FLT_EVAL_METHOD==2` is "unpredictable" because you have no way of knowing or controlling whether the compiler might run out of temp space in floating point registers and spill to nominal-precision storage on the stack. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 16:42
As for whether `u2 + u1 * (u3-u2)` has the same issue, at least nothing can go wrong in the `u2==u3` case, which is the most pathological one to mess up. :-) Moreover, assuming `u2` and `u3` have the same sign and magnitude (exponent), `u3-u2` is exact, and multiplying by a `u1` in the specified range could never cause the final result to go out of range, regardless of rounding mode. The only case that requires further consideration is when `u3` and `u2` have different magnitude. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 16:46
My understanding of `FLT_EVAL_METHOD==2` is that it guarantees `long double` precision for all intermediate computations. If an intermediate value is spilled, it is spilled to a `long double` stack location. This is how modern GCC does it (I am pretty sure). What you described is what I called I-don't-care compiler. We agree that those are the worst. – Pascal Cuoq Dec 05 '12 at 16:48
Hmm, I wasn't aware that GCC had fixed this. If so, I think the behavior of a program using floating point is unique/deterministic for each value of `FLT_EVAL_METHOD` in {0, 1, 2} (only "unpredictable" for negative values like FreeBSD uses). This is a very welcome development. – R.. GitHub STOP HELPING ICE Dec 05 '12 at 16:52
I did not test it (I am not that interested in x87) but my source is http://gcc.gnu.org/ml/gcc-patches/2008-11/msg00105.html – Pascal Cuoq Dec 05 '12 at 16:53
Okay, so it does not look that there are any inputs that make your function falsify the property from the question: the property is only difficult to achieve when `u2` and `u3` are close, and when they are close, `u3 - u2` is exact. However, your function does not have the other property that `interpol(1.,u2,u3)` always return `u3` (counter-example for `u2=-DBL_EPSILON` and `u3=2`). The function in my question has this property. Perhaps that's the reason it was written the way it is (I didn't write it). – Pascal Cuoq Dec 05 '12 at 20:19
A robust implementation probably chooses the version based on whether `u2` and `u3` are close... – R.. GitHub STOP HELPING ICE Dec 05 '12 at 21:20

Properties of 80-bit extended precision computations starting from double precision arguments

2 Answers2

Linked