Rounding a double to avoid round off in subsequent summation

Question

How do I implement this?

// round to the nearest double so that x + ref doesn't cause round off error
double round(double x, double ref) { }

So that

double x = ....;
double y = ....;

double x_new = round(x, y);
return x_new + y; // NO ROUND OFF!

In other terms (y + x_new) - x_new is strictly equal to y

This is not generally possible in the way you want. The text “to the nearest double” suggests you want to return an `x_new` near `x`. However, if `x` is larger than `y`, then any addition to `y` of some number near `x` will force some of the low bits of `y` out of the sum. E.g., suppose `x` is 2**64 and `y` is 1. The nearest `double` (using IEEE-754 64-bit binary) that satisfies the requirement is 2**53-1, and that is not at all near 2**64. — Eric Postpischil, Jan 31 '14 at 16:14
@eric , that's correct. Let's assume that x normalized exponent (as returned by frexp) is smaller or equal to y/ref normalized exponent — paolo_losi, Jan 31 '14 at 16:19
I changed the title because “double rounding” already has a specific meaning, with “double” used as an adjective. — Pascal Cuoq, Jan 31 '14 at 18:41
Note that checking that `(y + x_new) - x_new == y` is not the same as checking that `x_new + y` is exact. The latter implies the former but is not necessary for it. Example: `y=1e300, x_new=1`. `(y + x_new) - x_new == 1e300 == y` but the addition `y + x_new` is not exact. — Pascal Cuoq, Jan 31 '14 at 19:07

score 2 · Accepted Answer · edited May 23 '17 at 11:49

2

Let us assume that x and y are both positive.

Let S be the double-precision sum x + y.

There are two cases:

if x ≤ y, then S - y is exact by Sterbenz's lemma. It follows that the addition (S - y) + y is exact (it produces exactly S, which is a double-precision number). Therefore, you can pick S - y for x_new. Not only y + x_new is exact, but it produces the same result S as y + x.
if x > y, then depending on the number of bits set in the significand of y, you may have a problem. If the last bit in the significand of y is set, for instance, then no number z in a binade after the binade of y can have the property that z + y is exact.

This answer is vaguely related to that answer.

edited May 23 '17 at 11:49

Community

1
1

answered Jan 31 '14 at 19:06

Pascal Cuoq

79,187
7
161
281

0b10000011000100100110111010010111100011010100111111100 <- 0.001 0b10000011000100100110111010010111100011010101000000000 <- 0.001 rounded ref 1.0 @pascal solution 0b10000011000100100110111010010111100011010100000000000 <- 0.001 rounded ref 1.0 my solution If I'm not mistake your solution is not correct (9 bits zeroed instead of 10 bits). I guess it depends on floating point rounding mode. What do you think? – paolo_losi Jan 31 '14 at 22:33
@paolo_losi I was too slowly arriving to the conclusion that my solution was slightly off (great that you found an example). To be precise, I was wondering whether changing my solution to `2*y - (2*y - x)` would make it correct (still assuming 0 ≤ `x` ≤ `y`). At least now I can try it on your example, which will be a clue if not a proof yet. – Pascal Cuoq Jan 31 '14 at 22:39
@paolo_losi I have revamped my answer quite a bit. It now contains a demonstrably correct solution for the same hypotheses my previous answer wanted (0 ≤ `x` ≤ `y`). In the case where `x` is larger than `y` it may be impossible to provide a satisfactory `x_new`, for instance if the significand of `y` ends in a set bit. – Pascal Cuoq Feb 01 '14 at 18:19

paolo_losi · Answer 2 · 2014-01-31T17:24:20.533

0

A possible solution in Python that can be straightforwardly translated to C

import math
from decimal import Decimal


def round2(x, ref):
    x_n, x_exp     = math.frexp(x)
    ref_n, ref_exp = math.frexp(ref)
    assert x_exp <= ref_exp
    diff_exp = ref_exp - x_exp

    factor = 2. ** (53 - diff_exp)

    x_new_as_int = int(round(x_n * factor))
    x_new_as_norm_float = float(x_new_as_int) / factor
    return math.ldexp(x_new_as_norm_float, x_exp)


x = 0.001
y = 1.0

assert (y + x) - x != y

x_new = round2(x, y)

assert (y + x_new) - x_new == y

print "x:", Decimal(x)
print "x_new:", Decimal(x_new)

print "relative difference:", (x_new/x - 1.)

edited Jan 31 '14 at 17:24

answered Jan 31 '14 at 17:00

paolo_losi

283
1
7

Looks like this solution relies on IEEE-754 64-bit binary. C does not specify `double` as [IEEE-754 64-bit](http://en.wikipedia.org/wiki/Double_precision_floating-point_format). Are you looking for a solution that works in C generically or not? – chux - Reinstate Monica Jan 31 '14 at 17:53
@chux it definitely should be C99 portable. I wasn't aware that ieee-754 64 bit != c99 double. – paolo_losi Jan 31 '14 at 18:09

Rounding a double to avoid round off in subsequent summation

2 Answers2