The precision of a large floating point sum.2

Question

I've read that when adding up large quantities of floating points, the best way to do it is to sum from lowest to highest. I wrote a code that did exactly that but the sum from highest to lowest was more precise and I didn't understand why, I asked it here. I accepted an answer that said it was because local arrays are created on the stack, where space is limited. This code from the answer didn't use any array and therefore it was more precise:

#include <stdio.h>

int main() {

double pi2over6 = 1.644934066848226;
double sum = 0.0, sum2 = 0.0;
double y;
int i, n;

printf("Enter number of iterations:\n");
scanf("%d", &n);

y = 1.0;

for (i = 0; i < n; i++) {
    sum += 1.0 / (y * y);
    y += 1.0;
}

for (i = 0; i < n; i++) {
    y -= 1.0;
    sum2 += 1.0 / (y * y);
}
printf("sum from biggest to smallest is %.16f\n", sum);
printf("and its error %.16f\n", pi2over6 - sum);
printf("sum from smallest to biggest is %.16f\n", sum2);
printf("and its error %.16f\n", pi2over6 - sum2);
return 0;

And yes, when the input is 1000000000 it is significantly more precise, if the input is, for example 1000 I still have the same problem. Why?

Please explain how you know the "correct" answer and hence the precision of the computed answers. And the second loop is relying on the final `y` from the first loop, which as `y` is `double` does not seem guaranteed to to be correct. How confidant are you that `1.644934066848226` is an exact representation? — Weather Vane, Mar 10 '18 at 22:29
The sum 1/k^2 from k=1 to k=n as n-> infinity converges to Pi/6, that is 1.644934966848226... I don't understand the second part of your comment, could you elaborate? — codingnight, Mar 10 '18 at 22:34
Look at it this way: if you add a million very small floating point numbers to a very large one, none of them might change it. But if you sum the million small values first, then adding that to the large value might be significant. — Weather Vane, Mar 10 '18 at 22:45
The “code from the answer” is not more precise for not using an array. It avoids crashing for nor using an array. What is your question? What do you observe and what did you expect? We have no idea what “the same problem” is. — Pascal Cuoq, Mar 10 '18 at 22:47
@PascalCuoq My question is: Why when the input is 1000000000 the sum from lowest to highest is more precise than the sum from highest to lowest, but when the input is 1000, it is not? The sum from lowest to highest should always be more precise, right? — codingnight, Mar 10 '18 at 22:49
In this question and in your other question, you make the same mistake of not saying what you see. You claim that “the sum that starts on the highest number is more precise”. We don't believe you. What result do you get? — Pascal Cuoq, Mar 10 '18 at 22:50
please state the expected behavior and the behavior you observe. "More precise" tells us **nothing**. What is more precise than what? — bolov, Mar 10 '18 at 22:54
I expect that the sum from lowest to highest to be closer to the value 1.644934066848226 than the sum from highest to lowest, however, when the input is [1000](https://wandbox.org/permlink/VgpH87yrsI8fsOb8) the sum from highest to lowest is closer to 1.644934066848226. You can see this when the program prints the two sums and its errors. I know for certain that both sums converge to 1.644934066848226. — codingnight, Mar 10 '18 at 23:02
When the input is 1000, the difference between the sum and the ideal value is about 1e-3. And the difference between the two sums is about 2e-15 (which is 8 LSBs). Which means that you're dealing with random rounding errors. You are nowhere near the point that the order matters. — user3386109, Mar 10 '18 at 23:53
Because if `k` is 1000, then `1/k^2` is 1e-6. And you're adding that value to a sum that's about 1.6 using a double that has precision down to about 1e-15. So the value that you're adding is not small compared to the precision. Order doesn't matter until the values are small compared to the precision. — user3386109, Mar 11 '18 at 01:01

Eric Postpischil · Accepted Answer · 2018-03-11T03:10:29.717

The sum computed from lowest to highest is more accurate; it differs from the exact result by about 1 ULP, while the sum computed from highest to lowest differs by about 7 ULP.

It is erroneous to compare the sums for 1000 terms to the infinite sum. The difference between the exact mathematical sum of the first 1000 terms and the infinite sum is much larger than the error that occurs in computing the former sum. The three sums of 1000 terms (exact mathematical, from lowest to highest, from highest to lowest) are within 8 ULP of each other, while the sum of the infinite series is 4.5 trillion ULP higher.

(Additionally, the estimate for π²/6 is inaccurate. Your program has 1.644934066848226. In IEEE 754 basic 64-bit binary floating-point (which I will use for this answer), that rounds to exactly 1.6449340668482259619764818125986494123935699462890625. However, 1.6449340668482264060656916626612655818462371826171875 is closer to π²/6.)

The exact mathematical sum of the first 1000 terms is near 1.6439345666815598031390580238222155896521034464937. This was calculated with extended precision using Maple. The closest representable value to that is 1.6439345666815599056320706949918530881404876708984375. I will call this the best sum, as it is the closest possible any computed sum could be to the exact mathematical value, since no representable value is closer.

The sum computed from highest to lowest is 1.643934566681561459944305170211009681224822998046875. This differs from the best sum by 1.5543122344752191565930843353271484375e-15, which is 7 ULP. (An ULP is the smallest amount by which a representable value may be adjusted upward; it is the step size of the representable values. The ULP is a function of the magnitude of a number; it is larger for larger numbers.) A difference of 7 ULP means the number is 7 steps away from the best sum.

The sum computed from lowest to highest is 1.643934566681559683587465769960545003414154052734375. This differs from the best sum by -2.220446049250313080847263336181640625e-16, which is 1 ULP in the other direction (it is below the best sum).

Thus, the sum computed from lowest to highest is more accurate.

The infinite sum is 4.5 trillion ULP higher.

So, relative to the best sum, the numbers are in this order:

The sum computed from lowest to highest is at −1 ULP.
The best sum is the reference point, at 0 ULP.
The sum computed from highest to lowest is at +7 ULP.
π²/6 is at 4.5 trillion ULP.

Now it is clear that the sum computed from lowest to highest is farther from π²/6 because it is on the other side of the best sum. But it is closer to the best sum.

In any case, there is no absolute rule that computing a sum from lowest term to highest term produces the best answer. This is a general guideline, based on the notion that errors are smaller while numbers are smaller.

Every time you add two numbers in floating-point, there may be a small error, as the exact mathematical result must be rounded to a representable value. The errors are always at most ½ ULP of the result (because, if a representable value is more than ½ ULP from the exact mathematical result, then the next representable value in the correct direction is closer than ½ ULP).

However, the errors can vary between 0 and ½ ULP, and they can also be positive or negative. Suppose you add a series of random numbers. It is possible that, while adding numbers from highest to lowest, we will encounter a mix of positive and negative errors that just happen to largely cancel out, producing a final result near the exact mathematical result by chance. At the same time, adding numbers from lowest to highest might happen to encounter a lot of positive errors, thus accumulating into a large error.

Adding numbers of the same sign from lowest magnitude to highest magnitude tends to produce a better result, but it is not an absolute rule.

Also, if the numbers are of mixed signs, it may be advantageous to choose which numbers to add to keep the running sum near zero, rather than always choosing the next number of the lowest magnitude.

The precision of a large floating point sum.2

1 Answers1