This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.
In this answer, I use the conditions:
- Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
- We have a sequence of numbers s0, s1, s2,… sn−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.
(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)
Definition: is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)
Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½. In a directed rounding mode, such as toward-infinity or toward-zero, u is . (Currently, I am not otherwise considering directed rounding modes.)
Ordering by least bit is exact.
Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2−5, its trailing one bit is 0.001•2−5 = 2−8.
Suppose we use this algorithm to sum the numbers:
- Let S be a set containing the numbers in the sequence.
- Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
- Replace a and b in S with the computed sum of a and b.
- Repeat until S contains one element.
The number remaining in S is exactly 1, with no error.
Proof:
- If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
- The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
- So all arithmetic operations in the algorithm are exact, so the final sum is exact.
Sorting by magnitude
Suppose the numbers are sorted in ascending order, and we add them sequentially: S0 = s0, S1 = S0 + s1, S2 = S1 + s2,… Sn−1 = Sn−2 + sn−1. Here, the Si represent exact mathematical sums. Let Ti be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.
A bound on the error in the addition that produces each Ti is uTi. For this initial, analysis, I will assume uSi adequately approximates uSi. Then a bound on the total error is uSi summed over i, excluding i=0 (since there is no error in setting S0 to s0; the first addition occurs with S1), which I will write sum(uSi).
This equals u sum(Si). And sum(Si) = S1 + S2 + … Sn−1 = (s0) + (s0 + s1) + (s0 + s1 + s2) + ... = (n−2)•s0 + (n−2)•s1 + (n−3)•s2 + 1•sn−1. Permitting any real values in [0, 1], this sum is maximized when s0 and the remaining si are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of Ti with Si).
Then sum(Si) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.
Thus, in adding a million numbers, we can expect a maximum error around 250,000 . (Again, we have approximated by assuming the Si stand in for the Ti.) Typical error would of course be much smaller.
Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2149 when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2147. However, the constraints of the problem necessitate that each si is exactly 2−149, in which case T224−1 is 2−125 (no error has yet occurred) but then each subsequent Ti is also 2−125 due to rounding (adding 2−149 to 2−125 yields 2−125 due to the precision), so the final result is 2−125, meaning the error is 1−2−125, which is much less than 2147.
Arbitrary ordering
Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.
Discussion
The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uSi, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.
The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.