sum of small double numbers c++

Question

Assume that we have an array of small (about 10^(-15) ) double numbers in c++. If we calculate the sum of numbers in this array sequentially, for example

double sum = 0;
for (int i = 0; i < n; i++) sum+=array[i];

we get some value x.

But if we divide an array into some parts and then calculate the sum in each part and after this we add all the partial sums together we get some value x2, which is close to x but not exactly x. So I have lost accruacy in calculating sum.

Does someone know how to calculate the sum of small double numbers by partitioning these numbers into some parts without loosing accuracy?

Yes you can, but it requires you to turn them into this form `a/b`. Otherwise, you can't. But in general, you can't, because that is not built-in into the C++ language. In fact, you really can't because 0.66666666 you can't turn that into rational. Maybe `Boost` will help. When you say `without losing accuracy` you mean without suffering from too much (you need some degree of expectation). — CppLearner, Apr 26 '12 at 08:51
How do you know you have lost more accuracy with one method over the other? — juanchopanza, Apr 26 '12 at 08:51
by writing results to the file. for example:1.9999920040000227000000000 vs 1.9999920040000230000000000 or 0.0000000118158038744022090 vs 0.0000000118158038365459910 — Nurlan, Apr 26 '12 at 08:54
If they are all about equally small, then their smallness makes no difference to the accuracy of the result. If you multiply them all by 2^50, add them up, and then divide the result by 2^50, you will get exactly the same result. — TonyK, Apr 26 '12 at 08:56

score 22 · Accepted Answer · answered Apr 26 '12 at 09:39

Using Kahan Summation:

#include <numeric>
#include <iostream>
#include <vector>

struct KahanAccumulation
{
    double sum;
    double correction;
};

KahanAccumulation KahanSum(KahanAccumulation accumulation, double value)
{
    KahanAccumulation result;
    double y = value - accumulation.correction;
    double t = accumulation.sum + y;
    result.correction = (t - accumulation.sum) - y;
    result.sum = t;
    return result;
}

int main()
{
    std::vector<double> numbers = {0.01, 0.001, 0.0001, 0.000001, 0.00000000001};
    KahanAccumulation init = {0};
    KahanAccumulation result =
        std::accumulate(numbers.begin(), numbers.end(), init, KahanSum);

    std::cout << "Kahan Sum: " << result.sum << std::endl;
    return 0;
}

Output:

Kahan Sum: 0.011101

Code here.

Just to say, the example doesn't really prove anything as you'd get the same result with normal summation. I haven't worked in C++ since 2014 and don't have time to invest in improving it, but suggestions/edits are welcome. — Peter Wood, Mar 10 '22 at 09:11

score 4 · Answer 2 · answered Apr 26 '12 at 09:04

The absolute size of the numbers is not the issue.

If you want a more accurate summation, have you considered a compensated sum? http://en.wikipedia.org/wiki/Kahan_summation_algorithm

However, if you really mean without losing any accuracy, your result will not necessarily fit in a double. If this is really what you want, you could look Algorithm 908 at http://dl.acm.org/citation.cfm?id=1824815 or similar.

score 3 · Answer 3 · answered Apr 26 '12 at 08:51

3

The trick in those cases is to first order the array from smaller to higher, and then sum then in the cycle you've made. That way, the accuracy is best.

You can also check Kahan summation algorithm

answered Apr 26 '12 at 08:51

Jorge Leitao

19,085
19
85
121

The Kahan algorithm is the most general, but if the values are all more or less of the same magnitude, recursively dividing the array in 2 and summing each half will provide significantly better precision. – James Kanze Apr 26 '12 at 09:24

score 2 · Answer 4 · edited May 23 '17 at 12:02

2

Consider to apply Kahan summation algorithm for both your entire set or each of your subsets.

There are other questions referencing this algorithm that can help you

edited May 23 '17 at 12:02

Community

1
1

answered Apr 26 '12 at 08:55

Zeugma

31,231
9
69
81

score 1 · Answer 5 · answered Apr 26 '12 at 08:50

The double numbers in the computer are stored in binary numeric system. That is why when you see a double value(in decimal notation) you in fact see the double value with some rounding(for instance 0.1 is infinite fraction). You can do the same experiment where the double values are degree of 2(for instance 2^(-30)) and then you will see that the values will match.

The reason that you observe difference when you sum the double values in different sequence is that after each calculation the result will be rounded in binary numeric system and so a little difference from the actual value will appear.

score 1 · Answer 6 · answered Apr 26 '12 at 08:50

1

Binary floating point numbers used to represent decimal numbers have more precision than accuracy. You have found one way of surfacing the difference.

answered Apr 26 '12 at 08:50

Martin James

24,453
3
36
60

score 1 · Answer 7 · edited May 23 '17 at 12:26

1

It could be that your individual summations are being optimised and performed in register at 80 bits but then transfered back to 64 doubles (read about compiler switches). Naturally this would lose precision. If this is the case then breaking up the array and adding the individual 64-bit sums would give a different answer to adding them all as 80-bit aand converting the grand total back.

This may not be the reason but it might be worth researching further. Look at the chosen answer to this question

edited May 23 '17 at 12:26

Community

1
1

answered Apr 26 '12 at 09:03

Component 10

10,247
7
47
64

1

I don't think that this is really the problem. The problem is that as he advances in the table, the magnitude of the sum becomes large with respect to the magnitude of the individual values, and adding values of different magnitudes loses a lot of precision. – James Kanze Apr 26 '12 at 09:23
Well I said it *could* be the problem. I agree that adding numbers of different magnitudes can lose precision, but isn't he suggesting that they are *all* small '(about 10^(-15) )' and his problem is not that adding them all together loses precision but that adding them all together in one go gives a more precise answer than summing them in batches and then summing the totals of the batches? – Component 10 Apr 26 '12 at 10:34
The values in his array all have a similar magnitude. After he's added the first couple of hundred, however, the sum he's adding to has a noticeably greater magnitude. And the solution summing the batches is probably more precise, since the differences in magnitude will be less. – James Kanze Apr 26 '12 at 13:13

score 0 · Answer 8 · answered Sep 06 '12 at 12:49

Loss of precision in the result of adding numbers is not different when dealing with very small numbers from processing normal-size numbers. What may be relevant is: a) are the RELATIVE differences in size between the numbers large? b) have the numbers different SIGNS?

The last issue is usually at stake with addition-precision. What you should do - maybe not completely optimal, but a fair shot, and easy to implement - is:

a) split them in subsets of positives and negatives respectively

b) sort each subset

Then

c) take the largest (in absolute size) from the two sets combined, and initialize your sum with that number, and remove it from its list

d) iteratively: whenever the current sum is positive, take the largest remaining negative and add it to the sum, and remove it from its list; whenever the current sum is negative, do likewise.

In this way you have a fair chance that you've (almost-)minimized the loss of precision to what is inherently unavoidable (given the presentation of numbers).

sum of small double numbers c++

8 Answers8

Linked