Exactly replicating repeated addition of floating-point values

Question

Is there a sub-O(n) way¹ of replicating repeated summation of floating-point values in IEEE 754 arithmetic?

That is, is there a faster way of exactly² replicating the result of the following pseudocode:

f64 notQuiteFMA(f64 acc, f64 f, bigint n) {
    for (bigint i = 0; i < n; i++) {
        acc = acc + f;
    }
    return acc;
}

?³

In real arithmetic the answer is straightforward - return acc + f*n is sufficient. However, in finite precision the answer is far more complex. As one example of this, note that notQuiteFMA(0, 1, 1 << 256) is somewhere around⁴ 2^53, not the 2^256 that one would expect in infinite precision. (This example also shows why speeding this calculation up could be desirable - counting to 2^256 is rather impractical.)

I am aware that this entire thing could be technically done in O(1) time⁵ with a 2^128 entry lookup table with each entry having an array of 2^64 elements; kindly ignore such galactic algorithms. (Pigeonhole principle - repeated addition can not progress more than 2^64 steps before entering a loop - so 'just' store the entire prefix and the number of steps in the final loop for every initial value of acc and f.)
Bit-exact given the current rounding mode (ignoring the can of worms that is non-signaling NaNs at least).
Treat this as pseudocode - i.e. the sequence of double-precision floating-point additions described. I am aware there are cases in certain programming languages where floating-point calculations can be done in higher precision / reordered to make things more SIMD-friendly / etc / etc; I am not interested in these cases for this question.
64-bit floats have a 52 bit mantissa; x + 1 == x first occurs around 2^53 when the addition is insufficient to increment the mantissa and the result is rounded down. Exact value depends on rounding mode I believe.
Assuming a computational model where modulo of a bigint by a constant-bounded value is O(1), at least.

I suspect that your question arises as a proposed solution to a different question. Are you aware of interval arithmetic? These libraries do normal computations, but every number retains an upper and lower bounds. All computations are done so that even if you don't know the exact answer, you do know that the exact answer lies between those two bounds. — Tim Roberts, Aug 20 '23 at 23:34
Now that I look at it, I'm confused by your question. You say you want to exactly replicate the IEEE-754 result of that pseudocode, but that's not what you want. The result of that pseudocode is about `2^53`, as you say, because after that point, you are adding 0 each time. You seem to be after infinite precision floating point, which is contradictory. There would never be any rounding errors. Why not use big integers? — Tim Roberts, Aug 20 '23 at 23:39
You seem to be confused. I do _not_ want the infinite-precision result. I wish to accurately replicate the "inaccurate" result without having to do the O(n) additions. — TLW, Aug 21 '23 at 02:57
'accurately replicate the "inaccurate" result' here meaning "coming up with the same bit-exact result as the naive O(n) summation described". — TLW, Aug 21 '23 at 02:58
" You say you want to exactly replicate the IEEE-754 result of that pseudocode, but that's not what you want.". That _is_ precisely what I want. — TLW, Aug 21 '23 at 02:59
FYI, the preferred term for the fraction portion of a floating-point representation is “significand.” “Mantissa” is an old term for the fraction portion of a logarithm. Significands are linear (adding to a significand adds to the value represented). Mantissas are logarithmic (adding to a mantissa multiplies the value represented). The significand of an IEEE-754 binary64 (“double precision”) has 53 bits, not 52. 52 are encoded via the primary significand field, and 1 is encoded via the exponent field. — Eric Postpischil, Aug 21 '23 at 13:20
Re “Exact value depends on rounding mode I believe”: The change occurs at the same place for all rounding modes. When adding 1 repeatedly, all rounding modes produce the same result up to 2^53, since all results are representable. For 2^53+1, round toward 0, round toward −∞, and round to nearest with ties to even all produce 2^53, and thus further additions repeat this. Round toward +∞ and round to nearest with ties to away produce 2^53+2, and thus further additions step through all representable values. — Eric Postpischil, Aug 21 '23 at 13:24
'Mantissa' is also used by some authors to refer to the fractional portion of a floating-point representation. E.g. Wikipedia as of 1172319872 says "Logically, a floating-point number consists of: A signed (meaning positive or negative) digit string of a given length in a given base (or radix). This digit string is referred to as the significand, *mantissa*, or coefficient" (emp. added). Your comment that significand is technically more accurate is noted. — TLW, Aug 26 '23 at 17:59

Eric Postpischil · Accepted Answer · 2023-08-28T22:49:36.593

Conclusion

We can implement an O(log n) solution by stepping through binades. This is enabled by the fact that all additions within a binade, possibly except the first, change the accumulating sum by the same amount, and therefore we can easily calculate exactly what sum would be produced by repeated additions until the end of the binade.

Preliminaries

A binade is an interval [2^^e, 2^^e+1) or (−2^^e+1, −2^e] for some integer e. For purposes of this answer, the entire minimum exponent range (including the minimum-exponent normals and all the subnormals) will be considered a single binade.

For a floating-point number x (a number representable in the floating-point format), define E(x) to be the exponent used in the representation of x. Given the definition of binade used this answer, E(x) denotes the binade x is in, except for its sign.

Expressions in code format represent floating-point arithmetic. x+y is the real-number-arithmetic result of adding x and y. x+y is the floating-point result of adding x and y. x and x are used for the same value as appropriate.

Floating-point addition is weakly monotonic for finite values: y < z implies x+y <= x+z, in any rounding mode, except when x is an infinity and y or z is an infinity of the opposite sign, resulting in a NaN.

Discussion

Let A₀, A₁, and A₂ be the values of A before, between, and after two executions of A += f. Consider the case when E(A₀) ≥ E(A₁) = E(A₂). Let g be A₂−A₁. g is representable, either by Sterbenz’ Lemma if A₁ is normal or, if A₁ is subnormal, because A₁, A₂, and f are all subnormal (or zero) and there are no bits below the ULP of A₂ to be lost in the addition. (Since g is representable, A2-A1 calculates it with no rounding error.)

We will show that all subsequent additions of f that produce a value of A with E(A) = E(A₁) increase A by the same amount, g.

Let u be the ULP of A₁. Let r be the remainder of |f| modulo u. Let s be +1 if f is positive or −1 if f is negative. (The case when f is zero is trivial.) Using a directed rounding mode (toward −∞, toward zero, or toward +∞), the rounding error in any subsequent addition that does not change E(A) is either always −r or always u−r, according to the rounding method and the sign of f. Using round to nearest with ties to away, the rounding error is always −sr if r < ½u and always s(u−r) otherwise. With round to nearest with ties to even, if r ≠ ½u, the rounding error is always −sr or always s(u−r), as with ties to away.

That leaves round to nearest with ties to even with r = ½u. Let b be the bit of f in the u position. In this case, the low bit of A₁ must be even, because it was the result of adding f, which has remainder modulo u of ½u, to A₀ (which is necessarily 0 modulo u since E(A₀) ≥ E(A₁)) to produce A₁. In that addition, the real-number result was exactly midway between two representable values, so it was rounded to an even low digit (a multiple of 2u).

Given the low bit of A₁ is even, we have two cases:

b is 0, so the addition of f has a real-number-arithmetic result that is 0u + ½u modulo 2u ≡ ½u, and it is rounded to an even low digit, so the rounding error is −½u if the result is positive or +½u if it is negative. This then repeats for subsequent additions producing an A with E(A) = E(A₁).

b is 1, so the addition of f has a real-number-arithmetic result that is 1u + ½u modulo 2u ≡ 1½u, and it is rounded to an even low digit, so the rounding error is +½u if the result is positive and −½u if the result is negative. This then repeats for subsequent additions producing an A with E(A) = E(A₁).

(To see why we go to the second addition within a binade before asserting all subsequent additions add the same amount, observe the addition that produced A₀ may have involved bits from a prior A with a lower ULP, in which case the real-number-arithmetic result might not have been exactly midway between representable values, so A₀ could have an odd low digit even using round to nearest with ties to even, and A₂−A₁ would not equal A₁−A₀.)

Implementation

Given the above, we can calculate how many additions of g, say t can performed until just before the next binade is reached. Then we can calculate the exact value of A just before exiting the current binade, A+tg. Then a single addition gives us the first A in that binade, and another addition either takes us into a new binade or gives us the information needed to perform the calculation for the current binade. Then we can repeat this for each binade until the desired number of additions is reached.

Below is code demonstrating the concept. This code needs some improvement, per comments. I expect to work on it when time permits, which may be more than a few weeks from now. The E function in the code takes the exponent from frexp, not clamping it to the minimum exponent as described in the discussion above. This may slow the code but should not make it incorrect. I expect the code handles subnormals. Infinities and NaNs could be added via ordinary tests.

Calculation of t, the number of additions that can be performed before exiting the current binade, could use more discussion about the precision required and how rounding might affect it.

#include <math.h>
#include <stdio.h>
#include <stdlib.h>


static float Sum0(float A, float f, int n)
{
    if (n < 0) { f = -f, n = -n; }
    while (n--) A += f;
    return A;
}


//  Return the exponent of x, designating which binade it is in.
static int E(float x)
{
    int e;
    frexp(x, &e);
    return e;
}


static float Sum1(float A, float f, int n)
{
    if (n < 0) { f = -f, n = -n; }

    if (n == 0) return A;

    while (1)
    {
        //  Record initial exponent.
        int e0 = E(A);

        A += f;
        if (--n == 0) return A;

        //  If exponent changed, continue to do another step.
        if (E(A) != e0) continue;

        /*  Note it is possible that A crossed zero here and is in the "same"
            binade with a different sign.  With rounding modes toward zero or
            round to nearest with ties to away, the rounding caused by adding f
            may differ.  However, if we have crossed zero, the next addition
            will put A in a new binade, and the loop will exit this iteration
            and start a new step.
        */

        float A0 = A;
        A += f;
        if (--n == 0) return A;

        //  If exponent changed, continue to do another step.
        if (E(A) != e0) continue;

        //  Calculate exact change from one addition.
        float g = A - A0;

        //  If g is zero, no future additions will change the sum.
        if (g == 0) return A;

        /*  Calculate how many additions can be done until just before the
            next binade.

            If A and f have the same sign, A is increasing in magnitude, and
            the next binade is at the next higher exponent.  Otherwise, the
            next binade is at the bottom of the current binade.

            We use copysign to get the right sign for the target.  .5 is used
            with ldexpf because C's specifications for frexp and ldexpf
            normalize the significand to [.5, 1) instead of IEEE-754's [1, 2).
        */
        int e1 = e0 + (signbit(A) == signbit(f));

        double t = floor((ldexpf(copysignf(.5, A), e1) - A) / (double) g);

        /*  Temporary workaround to avoid incorrectly steppping to the edge of
            a lower binade:
        */
        if (t == 0) continue;
        --t;

        if (n <= t) return A + n*g;

        A += t*g;
        n -= t;
    }
}


static int Check(float A, float f, int n)
{
    float A0 = Sum0(A, f, n);
    float A1 = Sum1(A, f, n);
    return A0 != A1;
}


static void Test(float A, float f, int n)
{
    float A0 = Sum0(A, f, n);
    float A1 = Sum1(A, f, n);
    if (A0 != A1)
    {
        printf("Adding %a = %.99g to %a = %.99g %d times:\n", f, f, A, A, n);
        printf("\tSum0 -> %a = %.99g.\n", A0, A0);
        printf("\tSum1 -> %a = %.99g.\n", A1, A1);
        exit(EXIT_FAILURE);
    }
}


int main(void)
{
    srand(1);

    for (int i = 0; i < 1000000; ++i)
    {
        if (i % 10000 == 0) printf("Tested %d cases.\n", i);
        Test(rand() - RAND_MAX/2, rand() - RAND_MAX/2, rand() % (1<<20));
    }
}

@njuffa: I think a general algorithm is: Start of loop: A += f. If binade changed, go to start. A0 = A, A += f. If binade changed, go to start. Else set g = A-A0, calculate number of additions t until next binade, set A += t*g. (Insert tests at each step to stop when we reach the desired number of additions.) I think that will work for any starting A value and any combination of signs and magnitudes of A and f. The basic premise is that the second and subsequent additions in a binade have a fixed change to A, so we can calculate the entire binade. — Eric Postpischil, Aug 22 '23 at 11:03
@njuffa: Minor note: the “calculate number of additions until next binade” needs to be sensitive to whether that binade is the next higher or next lower (signs of A and g are the same or differ). And handle special cases, such as g is zero, in which case we return the current A, because all future additions will not change it. — Eric Postpischil, Aug 22 '23 at 11:29
@njuffa: I have code that covers the general case excluding infinities and NaNs—any start accumulator, any f, regardless of signs and relative magnitudes. It is running, but I need to test it further and write it up. Maybe later, got errands to do now. I added it to the answer as candidate code for evaluation. — Eric Postpischil, Aug 22 '23 at 13:30
@njuffa: Thanks. The error occurs for a step landing at a binade edge; the direct sum is in the lower binade, whereas the computed sum is in the higher binade. Likely due to stepping down a binade (`A` and `f` have opposite signs, so `f` is bringing `A` down in magnitude) and that making a finer result available (lower ULP in lower binade), which was not considered in my initial analysis of proceeding through a binade to the next higher one. An easy workaround is to decrease `t` by 1 so we never step to the edge of a binade. — Eric Postpischil, Aug 23 '23 at 23:38
I am interested in fleshing this out and writing it up, but it may take a while. I am traveling for several weeks soon. — Eric Postpischil, Aug 23 '23 at 23:40
@njuffa: If you want to test the workaround, after the line `double t = floor((ldexpf(copysign(.5, A), e1) - A) / (double) g);`, insert `if (t == 0) continue; --t;`. — Eric Postpischil, Aug 23 '23 at 23:40
Take your time. I did not mean to create any pressure, just signal interest. While I have no immediate need for this functionality, (1) it seems definitely useful and (2) after pondering it a bit I was skeptical that a general solution could be found (and am happy to learn that my assessment was probably incorrect :-) — njuffa, Aug 23 '23 at 23:46
@njuffa: That hits +∞. Not sure why it loops; the code ought to single-step through n. In any case, we probably just need to test for infinities and handle them. — Eric Postpischil, Aug 24 '23 at 02:47
@njuffa: Oh, it must loop because something goes wrong in the calculation of `t` when `A` is infinite, and that resets `n` to some large value, and that repeats. Stopping when an infinity is reached will cure that. — Eric Postpischil, Aug 24 '23 at 02:54
That jibes with my take on the issue. Quick fix for the overflow problem: After `float A0 = A; A += f;` insert `if (isinf(A)) return A;` I am qualifying that with another test run right now. — njuffa, Aug 24 '23 at 03:38
For handling NaNs, after `if (n == 0) return A;` I inserted `if (isnan (A) || isnan (f)) return A+f;`. And with that change in place I now see perfect bit-wise matches throughout, with the obvious exception of two NaN inputs, where either of the input NaNs could be returned, quietened. My code is at [Compiler Explorer](https://godbolt.org/z/Yr9cEx3j8). — njuffa, Aug 24 '23 at 04:12
I'll have to take some time to grok this; this seems like exactly what I was looking for otherwise. One thing that stands out immediately - this appears to have no special handling for subnormals. Does this handle them properly? — TLW, Aug 26 '23 at 17:55
@TLW: If it does not, it will when I am done. Subnormals will not be hard for this. — Eric Postpischil, Aug 26 '23 at 18:41
@TLW Extensive testing is **not** proof, but for what it is worth, I have observed no test failures due to subnormals with the slightly modified code I linked in my earlier comment. — njuffa, Aug 29 '23 at 01:15

aka.nice · Answer 2 · 2023-08-21T15:20:01.390

This is a tough problem...

At least we know that up to n=5, the result is always n*f in round to nearest even mode.
See for example Is 3*x+x always exact?

But then we gradually loose precision... At which pace?

I think that we can predict the case of overflow, so I will not bother with that. We can thus ignore the exponent part and focus on the significand.

The case of gradual underflow is also not very much interesting, because addition will be exact until the biased exponent reach 1.

So we can focus on significand in (1,2( interval, 1 being un-interesting.

I tried following Smalltalk snippet in order to get an idea of error bounds:

sample := Array new: 60000.
x := 1.0.
00001 to: 10000 do: [:i | sample at: i put: (x := x successor)].
x := 2.0.
10001 to: 20000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
20001 to: 30000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.5.
30001 to: 40000 do: [:i | sample at: i put: (x := x successor)].
x := 1.125.
40001 to: 50000 do: [:i | sample at: i put: (x := x predecessor)].
x := 1.125.
50001 to: 60000 do: [:i | sample at: i put: (x := x successor)].
(3 to: 10) collect: [:i |
    | setOfErrors |
    setOfErrors := sample collect: [:f | ((1 to: 1<<i) inject: 0 into: [:acc :ignored | acc+f]) - (1<<i*f) / f ulp] as: Set.
    {i. setOfErrors size. setOfErrors max log2}].

1<<i is a left shift, so equivalent to 2^i.
So the error bounds for the case when n is a power of two, for above significand sample is:

 #(#(3 5 4.0) #(4 9 6.0) #(5 17 8.0) #(6 33 10.0) #(7 65 12.0) #(8 129 14.0) #(9 257 16.0) #(10 513 18.0))

That is a maximum error of 2^(2*(i-1))*ulp(f) when n=2^i, with all the 2^(i-1)+1 possible combinations of k*2^i*ulp(f) errors, k=-2^(i-2)...2^(i-2).

That's many different possibilities until we reach 2 raised to the number of significand bits... So i doubt that you can find any easy function of the bit pattern, just for the default rounding mode.

EDIT

with the ArbitraryPrecisionFloat package for Squeak Smalltalk, I tried to generalize the formulation for Float with p significand bits.

For example, with 11 bit significand:

p := 11.
(3 to: p + 1) collect: [:power |
    | n setOfErrors |
    n := 1 << power.
    setOfErrors := (1 << p to: 2<<p - 1) collect: [:significand |
        | f acc |
        f := (significand asArbitraryPrecisionFloatNumBits: p) timesTwoPower: 0 - p.
        acc := 0. n timesRepeat: [acc := acc + f].
        (n*f-acc/(acc ulp)) asFraction] as: Set.
    {power. setOfErrors max log2 printShowingDecimalPlaces: 2. setOfErrors size}].

We get this results:

#(#(3 '1.00' 5) #(4 '2.00' 9) #(5 '3.00' 17) #(6 '3.91' 32) #(7 '4.91' 61) #(8 '5.88' 120) #(9 '6.77' 220) #(10 '7.41' 354) #(11 '7.41' 512) #(12 '10.00' 1024))

Note that past 2^(p+1) operands, the sum would not change anymore.

Up to power=floor(p/2), there are 2^(power-1)+1 different results as shown by previous sample.

Then the number of different results increases more slowly, but at the end, it reaches 2^(p-1), that is 1 different result for every 2 different significands.

This indicates that notQuiteFMA is going to be rather complex.

Inquiring about the pattern of (n*f-acc/(acc ulp)), we see that it seems to depends on the ceil(log2(n)) last bits of f significand, until the sum reaches a new binade...

So, maybe you could inquire something like:

what is the binade (exponent) of n*f versus that of f?
does the sequence of errors of n*(f+k*ulp(f)) follows a predictible pattern for the successors of f?

If you can answer positively, then there are some chances that you can devise an accelerated notQuiteFMA function...

Re “This is a tough problem”: [Naw, it is easy.](https://stackoverflow.com/a/76949252/298225) — Eric Postpischil, Aug 22 '23 at 00:26
@EricPostpischil ah yes, once you see that the exact same quantity is added at each floating point addition, until the accumulator pass the next binade, this becomes simpler. Great answer. — aka.nice, Aug 22 '23 at 08:57

score -2 · Answer 3 · answered Aug 21 '23 at 03:32

-2

I can answer this question, although I don't think you will like it.

The value will stop incrementing as soon as the number rolls over to 53 significant bits. We can compute that. Here is a script in Python that shows the point at which adding 1 to a float no longer changes the float:

import struct

j = 1
for i in range(1,58):
    j = 2*j
    r = float(j)
    s = r+1
    if r == s:
        pat = struct.pack('>d',r)
        print(pat.hex())
        print(i,j,r,s)
        break

Output:

4340000000000000
53 9007199254740992 9007199254740992.0 9007199254740992.0

The first line is the hex representation of that value. So, it is the hidden 1 with an exponent of 52.

answered Aug 21 '23 at 03:32

Tim Roberts

48,973
4
21
30

I don't think you understood OP's question. OP is aware that for any `f` at some point `acc` is big enough that `acc + f == acc` in floating point math. – chtz Aug 21 '23 at 08:35
I disagree. He asked if there was a way to exactly replicate the result of his initial pseudocode without requiring 2^256 loops. I have done exactly that. In every case, the loop stops being useful when the sum and the addend exponents differ by 52. – Tim Roberts Aug 21 '23 at 17:08
OP wanted to exclude a `2^256` LUT which would result in an `O(1)` algorithm. But sure, technically you are right, `O(min(2^53,n))` is technically the same as `O(1)`, but practically, it is still `O(n)`. – chtz Aug 22 '23 at 08:35
I guess I was trying to make a different point. I wasn't really proposing this algorithm as a method to solve his problem. I was trying to point out that there is an analytic method of determining the exponent mismatch point, where additional additions are discarded, without doing any computing at all. – Tim Roberts Aug 22 '23 at 18:29

Exactly replicating repeated addition of floating-point values

3 Answers3

Conclusion

Preliminaries

Discussion

Implementation