Result of the sum of random-ordered IEEE 754 double precision floats

Question

Here is a pseudocode of my problem.

I have an array of IEEE 754 double precision positive numbers.

The array can come in a random order but numbers are always the same, just scrambled in their positions. Also these numbers can vary in a very wide range in the valid IEEE range of the double representation.

Once I have the list, I initialize a variable:

double sum_result = 0.0;

And I accumulate the sum on sum_result, in a loop over the whole array. At each step I do:

sum_result += my_double_array[i]

Is it guaranteed that, whatever the order of the initial array of double, if the numbers are the same, the printed out sum result will be always the same?

Instead of _pseudo code_ give us a [MCVE] with sample inputs and expected vs current outputs please. — user0042, Nov 14 '17 at 22:41
I can pretty much guarantee the result will almost always be different. — Mooing Duck, Nov 14 '17 at 22:42
No, it is not guaranteed. For example, if all of the large values are first, then the small values might drop off entirely. — stark, Nov 14 '17 at 22:42
I can't share the code, but to me it is not needed, as this is a purely theoretical question, not a request of debug. — turbopapero, Nov 14 '17 at 22:43
@user0042 - I think this is a perfectly legitimate use of pseudo code and if you read the entire question you can see that it makes sense without a specific example. — bennji_of_the_overflow, Nov 14 '17 at 22:45
These are the answers I was expecting but I was looking for a bit more details. Can anyone give details or a clear explanation of this effect in a complete answer to the problem? — turbopapero, Nov 14 '17 at 22:47
@stark why people come to Stack Overflow telling other people to stop asking and start reading something, somewhere on the internet? To me this is like being in a disco asking to stop the music. — turbopapero, Nov 14 '17 at 22:54
@bennji_of_the_overflow Well, no. Some working experimental code showing evidence for doubts about stability would be essential as an effort of reseach. It's quite easy to setup some testcases with varying order of an array of double values and check if their results are the same. — user0042, Nov 14 '17 at 23:00
@user0042 there's always a way to put in place a testcase, difficulty depends only on the writer. Here I'm not trying to demonstrate if it's true or not, instead I'm looking for a detailed answer on this effect. By setting up a test case I'm just doing trial and error until something happens, but the testcase doesn't tell anything about the reason behind this effect. — turbopapero, Nov 14 '17 at 23:05
" this is a purely theoretical question" - So you don't mind removing the language tag? And in that case, the answer would be "no problem", as there is no specific format or range, etc. of the values presented. The fact you output something with C++ does not contribute to the question. You can answer this yourself much better within your constraints if you learned about _practical_ issues of floats. That's vital knowledge anyway (and you need it to understand an answer). — too honest for this site, Nov 14 '17 at 23:07
Are the numbers all positive (or all negative), or of mixed signs? I think mixed signs will give the worst and most surprising results. — aschepler, Nov 14 '17 at 23:14
@aschepler: Why? As long they are within range, I don't see a difference. — too honest for this site, Nov 14 '17 at 23:16
@aschepler: All else being equal, mixed signs could produce better results, as they would tend to lower the magnitude of the running sum, which makes fine precision available for subsequent additions. Although lower values could themselves be lost if an incoming value were huge. I suspect that if the numbers in the array have some of “even” distribution, not varying hugely in magnitudes, mixed signs would often have better results than homogeneous signs. — Eric Postpischil, Nov 15 '17 at 00:17
@auserdude There are a number of comments [1](https://stackoverflow.com/questions/47296419/result-of-the-sum-of-random-ordered-ieee-754-double-precision-floats/47299781#comment81544390_47296419) [2](https://stackoverflow.com/questions/47296419/result-of-the-sum-of-random-ordered-ieee-754-double-precision-floats/47299781#comment81544826_47296419) here that would be best included in the question itself if they are truly important to an answer that is sought. — chux - Reinstate Monica, Nov 15 '17 at 17:35

Eric Postpischil · Answer 1 · 2017-11-15T00:21:23.023

2

No.

As a simple example, adding 1 to 0x1p53 yields 0x1p53. (This uses hexadecimal floating-point notation. The part before the “p” is the significand, expressed in hexadecimal the same as a C hexadecimal integer constant, except that it may have a “.” in it to mark the start of a fractional part. The number following the “p” represents a power of two by which the significand is multiplied.) This is because the mathematically exact sum, 0x1.00000000000008p+53, cannot be represented in IEEE-754 64-bit binary floating-point, so it is rounded to the nearest value with an even low bit in its significand, which is 0x1p53.

Thus, 0x1p53+1 yields 0x1p53. So 0x1p53+1+1, evaluated left to right, also yields 0x1p53. However, 1+1 is 2, and 2+0x1p53 is exactly representable, as 0x1.0000000000001p+53, so 1+1+0x1p53 is 0x1.0000000000001p+53.

To show a more easily visualizable example in decimal, suppose we have only two decimal digits. Then 100+1 yields 100, so 100+1+1+1+1+1+1 yields 100. But 1+1+1+1+1+1+100 accumulates to 6+100 which then yields 110 (due to rounding to two significant digits).

edited Nov 15 '17 at 00:21

answered Nov 15 '17 at 00:15

Eric Postpischil

195,579
13
168
312

So, it seems that the only way to have the same result at each occurence is to sort the array first. – turbopapero Nov 15 '17 at 00:24
@auserdude: The potential bits in the `double` value span magnitudes from 0x1p-1074 (the lowest bit of a subnormal value) to 0x1p1023 (the highest bit of the largest finite value). Therefore with a data structure containing 1023 - -1074+1 = 2098 bits, you can construct an adder capable of performing the necessary arithmetic exactly, provided there is no overflow. – Eric Postpischil Nov 15 '17 at 00:26
Yes, I was aware of the extension of the representation but if I can't create such a structure it seems the only way is to sort the array, in case a deterministic result is needed, right? – turbopapero Nov 15 '17 at 00:30
1

@auserdude: You might also see https://github.com/python/cpython/blob/6545256df93ba54f811206107274cfa5a6d76b86/Modules/mathmodule.c#L1299-L1416 for a correctly-rounded (and hence order-invariant) summation algorithm – Mark Dickinson Nov 15 '17 at 13:09

chux - Reinstate Monica · Accepted Answer · 2017-11-15T15:31:20.323

Is it guaranteed that, whatever the order of the initial array of double, if the numbers are the same, the printed out sum result will be always the same?

No, FP addition is not associative. Remember it is called floating point - the absolute precision "floats" about relative to 1.0. Any given operation like addition (+) is subject to round-off error.

Yet if the sum is done and the inexact flag is clear, then yes, the order was not relevant.**

Simple counter example.

#include <math.h>
#include <float.h>
#include <fenv.h>
#include <stdio.h>

int main(void) {
  double a[3] = { DBL_MAX, -DBL_MAX, 1.0 };
  fexcept_t flag;

  feclearexcept(FE_ALL_EXCEPT);
  printf("%e\n", (a[0] + a[1]) + a[2]);
  fegetexceptflag(&flag, FE_INEXACT);
  printf("Inexact %d\n", !!(flag & FE_INEXACT));

  feclearexcept(FE_ALL_EXCEPT);
  printf("%e\n", a[0] + (a[1] + a[2]));
  fegetexceptflag(&flag, FE_INEXACT);
  printf("Inexact %d\n", !!(flag & FE_INEXACT));

  printf("%d\n", FLT_EVAL_METHOD);
  return (EXIT_SUCCESS);
}

Output

1.000000e+00  // Sum is exact
Inexact 0

0.000000e+00  // Sum is inexact
Inexact 1

0    // evaluate all operations ... just to the range and precision of the type;

Depending on FLT_EVAL_METHOD, FP math may use wider precession and range, yet the above extreme example sums will still differ.

** aside from maybe a result of 0.0 vs -0.0

To see why, try a based 10 text example with 4 digits of precision. The same principle applies to double with its usual 53 binary digits of precision.

a[3] = +1.000e99, -1.000e99, 1.000
sum = a[0] + a[1]   // sum now exactly 0.0 
sum += a[2]         // sum now exactly 1.0 
// vs.
sum = a[1] + a[2]   // sum now inexactly -1.000e99
sum += a[0]         // sum now inexactly 0.0

Re: "printed out sum result will be always the same" : Unless code prints with "%a" or "%.*e" with higher enough precision, the text printed may lack significance and two different sums may look the same. See Printf width specifier to maintain precision of floating-point value

aka.nice · Answer 3 · 2017-11-17T09:19:12.280

Let's just take an example: I'm transposing the floating point problem using a model in base 10 with only 2 significant digits to make it simple, operation result being rounded to nearest.

Say we must sum the 3 numbers 9.9 + 8.4 + 1.4
The exact result is 19.7 but we have only two digits so it hould be rounded to 20.

If we first sum 9.9 + 8.4 we get 18.3 which is then rounded to 18.
We then sum 18. + 1.4 we get 19.4 rounded to 19..

If we first sum the last two terms 8.4 + 1.4 we get 9.8, no rounding required yet.
Then 9.9 + 9.8 we get 19.7 rounded to 20., a different result.

(9.9 + 8.4) + 1.4 differs from 9.9 + (8.4 + 1.4), the sum operation is not associative and this is due to intermediate rounding. We could exhibit similar examples with other rounding modes too...

The problem is exactly the same in base 2 with 53 bits significand: intermediate rounding will be causing the non associativity whatever the base or significand length.

To eliminate the problem, you could either sort the numbers so that the order is allways the same, or eliminate the intermediate rounding and keep only the final one, for example with a super accumulator like this https://arxiv.org/pdf/1505.05571.pdf
...Or just accept to live with an approximate result (up to you to analyze average or worse error and decide if acceptable...).

Result of the sum of random-ordered IEEE 754 double precision floats

3 Answers3

Linked