7

Assuming strict IEEE 754 (no excess precision) and round to nearest even mode, is 3*x+x always == 4*x (and thus exact in absence of overflow) and why?

I was not able to exhibit a counter-example, so I went into lengthy dicussion of every possible trailing bit pattern abc and rounding case, but I feel like I could have missed a case, and also missed a simpler demonstration...

I also have an intuition that this could be extended to (2^n-1) x + x == 2^n x and testing every combination of trailing bits in this case is not an option.

We should have (2^n - 1) x == 2^n x - x by property of IEEE 754 as long as n <= 54, but y-x+x == y is not generally true...

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
aka.nice
  • 9,100
  • 1
  • 28
  • 40
  • Speculation: does it work for cases when x is at the smallest number (+ and -) to 0? – Rob Feb 11 '14 at 00:18
  • I have my doubts about the computation being exact, for fractions that use all the bits. Keep in mind that 4*x will leave the fraction unchanged, but 3*x may lose a low-order bit. – Hot Licks Feb 11 '14 at 00:45
  • @Rob: it trivially works for those cases, because every operation involved is exact. It’s more interesting for “full width” floating-point numbers, where some of the intermediate steps incur rounding, but the final result is exact anyway. – Stephen Canon Feb 11 '14 at 01:50
  • My proof is here: https://news.ycombinator.com/item?id=5671002 (short story: case analysis on whether `3*x` is in the same binade as `4*x` or in the same binade as `2*x`). In the same thread Stephen alluded to a proof by case analysis on the last three bits of the significand of `x`. – Pascal Cuoq Feb 11 '14 at 10:09

1 Answers1

7

In the following, math shown in code format is computed with IEEE 754 in round-to-nearest mode, and math not in code format is exact.

Let p be the number of bits in the significand.

Let f be the factor 2n-1 for a positive integer n and be exactly representable (np).

Let U(x) be the ULP of x. For normal values, U(x) ≤ 21-px.

Let t be f*x. If f*x is subnormal, then it is exactly fx. If it is normal, then t = fx+e for some |e| ≤ &half;U(fx) ≤ 2-px. Note that if |e| is exactly half an ULP, then it must equal the lowest bit of x that is set (since otherwise e would have more than one bit set and could not be half of an ULP).

Substituting for f, t = (2n-1)x+e.

t+x = (2n-1)x+e+x = 2nx+e.

Consider t+x. By IEEE-754 requirements of round-to-nearest, this must be within &half; an ULP of t+x, which we know to be 2nx+e. Clearly 2nx is representable (barring overflow), and |e| ≤ &half;U(fx) ≤ &half;U(2nx). Therefore t+x must be 2nx unless |e| is exactly half an ULP and the low bit of x’s significand is odd (since an even low bit wins the tie and gives 2nx).

If n is 1, then f is 1, and e is 0. If 2 ≤ n, then |e| ≤ 1/4 U(2nx) < &half;U(2nx). So a case where |e| is half an ULP and x’s low bit is odd does not occur.

Therefore t+x must be 2nx. (Overflow and NaN left as an exercise for the reader.)

Additionally, I tested exhaustively for IEEE-754 32-bit binary floating-point.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • 1
    Interestingly, this value is also exact when computed as `x+x+x+x` (or `(x+x)+(x+x)` or `4*x`, but those are obvious and unsurprising). It follows that most reasonable means of computing the quantity 5*x are also exact, but this does not hold for 6*x. – Stephen Canon Feb 11 '14 at 01:46
  • @StephenCanon Yes, (2^n+1)*x=2^n*x+x I wanted to answer to http://stackoverflow.com/questions/21676955/floating-point-product-expansion-equivalence that there is more than 1 wayto sum x+x+x+... but there is surprisingly few, 8*x==6*x+2*x==7*x+x – aka.nice Feb 11 '14 at 17:11