The precision of a large floating point sum

Question

I am trying to sum a sorted array of positive decreasing floating points. I have seen that the best way to sum them is to start adding up numbers from lowest to highest. I wrote this code to have an example of that, however, the sum that starts on the highest number is more precise. Why? (of course, the sum 1/k^2 should be f=1.644934066848226).

#include <stdio.h>
#include <math.h>

int main() {

    double sum = 0;
    int n;
    int e = 0;
    double r = 0;
    double f = 1.644934066848226;
    double x, y, c, b;
    double sum2 = 0;

    printf("introduce n\n");
    scanf("%d", &n);

    double terms[n];

    y = 1;

    while (e < n) {
        x = 1 / ((y) * (y));
        terms[e] = x;
        sum = sum + x;
        y++;
        e++;
    }

    y = y - 1;
    e = e - 1;

    while (e != -1) {
        b = 1 / ((y) * (y));
        sum2 = sum2 + b;
        e--;
        y--;
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", f - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", f - sum2);
    return 0;
}

In [test with Wandbox and n=50](https://wandbox.org/permlink/vH05qEPIFz9qz8fW), sum from smallest to biggest had higher precision than sum from biggest to smallest. Would you giving us your example that demonstrate "the sum that starts on the highest number is more precise"? — MikeCAT, Mar 03 '18 at 23:31
If you add small floating point numbers to a very large sum, that might make no difference. But if you sum the small numbers first, significance is retained until the large numbers are reached. — Weather Vane, Mar 03 '18 at 23:35
It's just a bug in your code. Even with `n` equal to 10,000,000 you don't get anywhere close to the point where the order matters. The difference between the two sums is 5 orders of magnitude less than the difference between the sums and the target value. — user3386109, Mar 03 '18 at 23:42
Offtopic: Why you write `1 / ((y) * (y))` and not `1 / ( y * y )`? — i486, Mar 04 '18 at 12:48
Regardless of the reason for the difference in particular, have a look at https://en.wikipedia.org/wiki/Kahan_summation_algorithm (Just a pointer, not sure how much of an advantage it brings here) — Marco13, Mar 04 '18 at 13:51
For very large collections, say 10^n elements, the latest entries will be subject to loss of precision of n digits relative to the accumulator, unless data follow a geometric progression... That's why I prefer a divide and conquer approach like in https://stackoverflow.com/questions/13417670/precise-sum-of-floating-point-numbers/13436556#13436556 (don't look at the implementation which does not scale well, just keep the idea) — aka.nice, Mar 05 '18 at 17:49

score 5 · Accepted Answer · answered Mar 03 '18 at 23:51

Your code creates an array double terms[n]; on the stack, and this puts a hard limit on the number of iterations that can be performed before your program crashes.

But you don't even fetch anything from this array, so there's no reason to have it there at all. I altered your code to get rid of terms[]:

#include <stdio.h>

int main() {

    double pi2over6 = 1.644934066848226;
    double sum = 0.0, sum2 = 0.0;
    double y;
    int i, n;

    printf("Enter number of iterations:\n");
    scanf("%d", &n);

    y = 1.0;

    for (i = 0; i < n; i++) {
        sum += 1.0 / (y * y);
        y += 1.0;
    }

    for (i = 0; i < n; i++) {
        y -= 1.0;
        sum2 += 1.0 / (y * y);
    }
    printf("sum from biggest to smallest is %.16f\n", sum);
    printf("and its error %.16f\n", pi2over6 - sum);
    printf("sum from smallest to biggest is %.16f\n", sum2);
    printf("and its error %.16f\n", pi2over6 - sum2);
    return 0;

}

When this is run with, say, a billion iterations, the smallest-first approach is considerably more accurate:

Enter number of iterations:
1000000000
sum from biggest to smallest is 1.6449340578345750
and its error 0.0000000090136509
sum from smallest to biggest is 1.6449340658482263
and its error 0.0000000009999996

I understand this code is better since it lets you calculate more iterations. However, why does using the array makes the program not work for large values of n? — codingnight, Mar 04 '18 at 12:25
@codingnight That's because local arrays like `terms[]` are created on the stack, where space is limited. The maximum size of a stack frame differs from one system to the next, but if your array needs more than a few KB (bearing in mind that one `double` occupies 8 bytes), then the stack is probably not what you need. Instead, if you need a large array, it's better to use routines like `malloc()` or `calloc()` to reserve space on the heap, where you will have several GB to play with. And of course, if you can get by without using an array at all, then that would be better still :-) — r3mainer, Mar 04 '18 at 13:10
I am sorry to comment this late, but testing your program here, with 1000 iterations still gives the same error I had, The sum from biggest to smallest is more precise: [test](https://wandbox.org/permlink/VgpH87yrsI8fsOb8) — codingnight, Mar 10 '18 at 15:17

score 2 · Answer 2 · 2018-03-05T09:10:47.977

When you add two floating-point numbers with different orders of magnitude, the lower order bits of the smallest number are lost.

When you sum from smallest to largest, the partial sums grow like Σ1/k² for k from N to n, i.e. approximately 1/n-1/N (in blue), to be compared to 1/n².

When you sum from largest to smallest, the partial sums grow like Σ1/k² for k from n to N, which is about π²/6-1/n (in green) to be compared to 1/n².

It is clear that the second case results in many more bit losses.

The precision of a large floating point sum

2 Answers2

Linked