My Java textbook states that adding from biggest to smallest is less accurate than adding from smallest to biggest when dealing with floating-point numbers. However, he doesn't go on to clearly explain why this is the case.
-
1Please read [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html) – Jim Garrison Jan 06 '18 at 05:57
-
1@JimGarrison Which part? If you're going to post a link to an article, at it would be useful to say how it answers the question. – Simon Byrne Jan 06 '18 at 06:39
-
1All of it. The question indicates a lack of understanding of floating point. The article is recognized as a canonical exposition of what floating point is and how it works. If you read and understand the article, then the answer to this question [***and many others***](https://stackoverflow.com/q/588004/18157) will be obvious. The article's title really says it all. You cannot work effectively with floating point without understanding the article's contents. – Jim Garrison Jan 06 '18 at 06:42
-
2You certainly can. That article is honestly a terrible way to learn about floating point, and is prescribed way too often by people on SO: you don't need to know the minutiae of how sticky and guard bits work to get a good grasp of floating point. – Simon Byrne Jan 06 '18 at 06:53
2 Answers
Floating point has a limited number of digits of precision (6 for float
, 15 for double
). The calculation
1.0e20d + 1
gives the result 1.0e20
because there is not enough precision to represent the number
100,000,000,000,000,000,001
If you start with the largest number then any numbers more than n
orders of magnitude smaller (where n
is 6 or 15 depending on type) will not contribute to the sum at all. Start with the smallest and you might sum several smaller numbers into one that will affect the final total.
Where it would make a difference is, for example
1.0e20 + 1.0e4 + 6.0e4 + 3.0e4
Assuming it's exactly 15 decimal digits precision (it's not, see the linked article below, but 15 is good enough for the example), if you start with the larger number, none of the others will make a difference because they're too small. If you start with the smaller ones, they add up to 1.0e5, which IS large enough to affect the final total.
Please read What Every Computer Scientist Should Know About Floating-Point Arithmetic

- 85,615
- 20
- 155
- 190
-
-
-
2@ayejaysee It's not, but let's say precision is 6 digits, in which case `1234560 + 4 + 4 + 4 = 1234560 + 4 + 4 = 1234560 + 4 = 1234560`, where each addition discards 7th digit. But `4 + 4 + 4 + 1234560 = 8 + 4 + 1234560 = 12 + 1234560 = 1234570`, because 7th digit is not discarded until the last addition operation, and therefore the carryover is retained. – Andreas Jan 06 '18 at 06:27
An excellent explanation is available in section 4.2 of "Accuracy and Stability of Numerical Algorithms" by Nick Higham. Below is my casual interpretation of this:
The key property of floating point is that when the result of an individual operation cannot be exactly represented, it is rounded to the nearest value. This has many consequences, namely that addition (and multiplication) is no longer associative.
The other main thing to note is that the error (the difference between the true value and the rounded value) is relative. If we use square brackets ([]
)to denote this rounding operation, then we have the property for any number x
:
|[x] - x| <= ϵ |[x]| / 2
Where ϵ is the machine epsilon.
So suppose that we want to sum up [x1, x2, x3, x4]
. The obvious way to do it is via
s2 = x1 + x2
s3 = s2 + x3 = x1 + x2 + x3
s4 = s3 + x4 = x1 + x2 + x3 + x4
As noted above, we can't do this exactly, so we're actually doing:
t2 = [x1 + x2]
t3 = [t2 + x3] = [[x1 + x2] + x3]
t4 = [t3 + x4] = [[[x1 + x2] + x3] +x4]
So how big is the resulting error |t4 - s4|
? Well we know that
|t2 - s2| = |[x1+x2] - (x1+x2)| <= ϵ/2 |t2|
Now by the Triangle inequality we can write
|t3 - s3| = |[t2+x3] - (t2+x3) + (t2+x3) - (s2+x3)|
<= |[t2+x3] - (t2+x3)| + |t2 - s2|
<= ϵ/2 (|t3| + |t2|)
And again:
|t4 - s4| = |[t3+x4] - (t3+x4) + (t3+x4) - (s3+x4)|
<= |[t3+x4] - (t3+x4)| + |t3 - s3|
<= ϵ/2 (|t4| + |t3| + |t2|)
This leads to Higham's general advice:
In designing or choosing a summation method to achieve high accuracy, the aim should be to minimize the absolute values of the intermediate sums
ti
.
So if you're doing sequential summation (like we did above), then you want to start with the smallest elements, as that will give you the smallest intermediate sums.
But that is not the only option! There is also pairwise summation, where you add up pairs in a tree form (e.g. [[x1 + x2] + [x3 + x4]]
), though this requires allocating a work array. You can also utilise SIMD vectorisation, by storing the intermediate sum in a vector, which can give both speed and accuracy improvements.

- 7,694
- 1
- 26
- 50