What is a suitable tolerance for expecting n floats to sum to 1

Question

I have n floats which should sum to 1, each of which can take any value between 0 and 1. I want to check that they sum to 1 but given inaccuracies with FP numbers, checking that the sum == 1 will not always work.

Therefore I want to check that Math.abs(sum - 1) < epsilon, but I am not sure what a good value of epsilon should be. Is it possible to determine the maximum inaccuracy?

UPDATE: I understand why floats are imprecise and I know I can handle probabilities in a more precise way. I just want to understand a little more about floats in particular.

UPDATE 2: The floats DO sum to 1 if I consider them mathematically (eg 0.1 + ... + 0.1)

Possible duplicate of [Manipulating and comparing floating points in java](https://stackoverflow.com/questions/2896013/manipulating-and-comparing-floating-points-in-java) — mkjh, Jan 02 '19 at 08:13
What is the purpose of your application, float and double are for scientific calculations, which is, they count significant digits and add/sum with respect to those significant digits. If you want precision use int log or for bigger numbers use BigDecimal — Murat Güvenç, Jan 02 '19 at 08:53
The other question asks how to solve a precision problem, I want to understand how imprecise the result can be which is why I think it's a different question. — PolyglotPiglet, Jan 02 '19 at 09:39
@talex: The problem can be solved using the precision as a parameter (e.g., expressed as some multiple of the “epsilon” of the format), so it is not necessary to provide a numerical value for it. — Eric Postpischil, Jan 02 '19 at 12:32
@mkjh: This is not a duplicate of [that problem](https://stackoverflow.com/questions/2896013/manipulating-and-comparing-floating-points-in-java). That question, depending on how it is interpreted, asks either for the error in a single multiplication (which is ½ ULP) or the error in general computations (which is not answerable as a general question; error may range from zero to infinity and NaN). This question poses a constrained answerable question. — Eric Postpischil, Jan 02 '19 at 12:35
@PolyglotPiglet: (a) By “n floats which should sum to 1,” do you mean you have n floating-point objects whose current values sum to exactly 1 when computed with exact mathematics (and thus the only problem is to add them correctly), or do you mean you have n floating-point objects that are the result of prior computations with some errors, and they would sum to 1 if those prior computations had no errors? The former question is answerable; the latter question is not without further information. (b) Do you have any choice about the order in which the numbers are added. E.g., may we sort them? — Eric Postpischil, Jan 02 '19 at 12:37
This is an interesting question (and I am working on an answer). Let be the ULP of 1. If n ≤ 2/+1, a bound on the error is (n-1)/4, and this is achievable when n is 1 modulo 4 with s[0] (the first element of the sequence to be added) = 1-(n-1)/4 and s[1] through s[n-1] = /4. For n ≠ 1 modulo 4, the best bound might be slightly smaller, but I am not sure yet. For n > 2/+1, a smaller bound might be provable. Also, these speak to the greatest magnitude negative errors. The problem is asymmetric, and the greatest positive errors may be smaller (but not larger). — Eric Postpischil, Jan 02 '19 at 15:24
@EricPostpischil There are n floating point numbers which sum to 1 mathematically. Eg 10 0.1s Thank you for your feedback. It's much appreciated. — PolyglotPiglet, Jan 02 '19 at 15:44
@EricPostpischil I'm making this question up so you may sort them if you like :) — PolyglotPiglet, Jan 02 '19 at 15:48
If the numbers are sorted so that s[0], s[1],… s[n-1] are in ascending order, and we add them, so the mathematical sums are S[1] = s[0]+s[1], S[2] = S[1] + s[2], or the actual computed sums are T[1], T[2], and so on, then the error in each computed sum is bounded by u•T[i], where u is the unit round-off (½ULP(1) for round-to-nearest), and the total error is bounded by u times the sum of the T[i]. Since s[0] is a component of n-1 S[i], s[1] is a component of n-2 of them, and so on, the sum of the S[i] is (n-1)s[0] + (n-2)s[1] +…+ 1s[n-1]. This is maximized when s[0] is large as possible. Since… — Eric Postpischil, Jan 02 '19 at 18:59
they are in increasing order, the maximum occurs when all the s[i] are equal. Then s[i] = 1/n, so S[i] = (i+1)/n, and sum(S[i]) = (2+n)•(n-1)/(2n). Assuming the T[i] are nearly the S[i], the error bound is nearly u•(2+n)•(n-1)/(2n), or about un/2. That is a general bound covering all situations; it can be improved for various cases. — Eric Postpischil, Jan 02 '19 at 19:04
For 32-bit float, a worse case occurs when n = 2^149 and all the numbers are 2^−149 (which is subnormal, the minimum positive normal). After 2^24 numbers have been added, computed sum is 2^−125, after which adding more instances of 2^−149 has no effect due to the precision, so the final sum after 2^149−1 additions would be 2^−125. Of course, nobody would expect to complete 2^149−1 additions within the lifetime of the universe, so this is not a practical situation. The problem ought to be partitioned into reasonable cases. — Eric Postpischil, Jan 02 '19 at 19:09

Hulk · Answer 1 · 2019-01-02T13:59:41.733

1

Yes, it is possible to determine the maximum possible "error" made by using fixed precision floating point numbers for addition.

However, the result of this analysis would not be a useful value for an epsilon.

Consider the following example:

float a = 1.0E-37f;
float b = Float.MIN_VALUE;

System.out.println((a+b) == a);
System.out.println(a);
System.out.println(1.0f - a);

This prints:

true
1.0E-37
1.0

So, if you perform an arbitrary number of additions of Float.MIN_VALUE to 1.0E-37f, the difference of the sum to 1.0f will still be 1.0f.

This shows that the maximum error introduced by finite precision floats for a sum that would add up to 1.0 when using infinite precision is actually 1.0f - obviously, that is not useful as an epsilon. Determining a useful epsilon is only possible when the requirements for what is "good enough" are known - this depends on what you want to use the result for, and cannot be answered in general.

Of course, the example above is a bit contrived in that it highlights a worst-case scenario that may not be possible in your case, depending on the actual values of n.

As @EricPostpischil mentions in the comments,

the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½).

As you can see from this formula, it takes a large value of n to come to an error of 1.0. (You may have noticed that I chose relatively small values in my example - it takes a lot of them to add up to 1).

Putting in some numbers:

int n = 33554433;
System.out.println((n-1)*0.5*Math.ulp(0.5f));

Yields

1.0

If your n is much lower, or you know about further constraints of your input numbers you did not yet tell us about, it may be possible to get a much more useful upper boundary for the error.

My point still stands, however - while knowing about this upper bound may be useful, it cannot be used as an epsilon for verifying a value is "good enough" in general.

edited Jan 02 '19 at 13:59

answered Jan 02 '19 at 08:45

Hulk

6,399
1
30
52

This is not correct; the maximum error is not 1, and an arbitrary number of additions is not possible because the problem states there are n numbers to be added. An easily proved bound comes from the fact the maximum error of adding any two numbers in [0, 1] is ½ ULP(½), and there are n−1 additions, so the total error is at most (n−1)•½•ULP(½). (ULP(½) is half the “epsilon” for the format.) However, it may be possible to improve this bound by showing that the constraint that the n numbers sum to 1 prevents their values from always obtaining the maximum error of a single addition. – Eric Postpischil Jan 02 '19 at 12:43
First, the problem is parameterized by n; the question asks for a bound given n, not for a bound for the entire set of all possible n. Thus, for n that are not large, an error bound is (n−1)•½•ULP(½), and 1 is not obtainable. Second, for large n, the bound is different, but still less than 1. Given n floating-point values in [0, 1] whose sum is 1, at least one of them must be positive, and therefore the sum computed with floating-point must be positive, and therefore the difference between 1 and the computed sum must be less than 1. So the error can never be 1. This answer is incorrect. – Eric Postpischil Jan 02 '19 at 13:58
Yes, given some positive numbers `a` and `b`, the computed sum of `a+b` might be `a`, and so the error is 100% of `b`. But that is not what the question asks. – Eric Postpischil Jan 02 '19 at 13:59
@EricPostpischil I edited the answer. As I read it, the question has two parts. First, is it possible to determine the maximum error (answer: yes), and second, is that a useful epsilon for accepting or rejecting a result, and my answer to that one is "No, not in general". – Hulk Jan 02 '19 at 14:03
The proof given above that the error is never 1 is just a start, a proof of principle. Even for very large n, I suspect a significant bound can be proven. There is no error when adding zero, so we can consider only the number of non-zero values. If this number is very large, many of the numbers are very small, and the error of adding them is very small. There is likely a significant bound on the total error possible. The question asked by OP is interesting and should not be given short shrift. – Eric Postpischil Jan 02 '19 at 14:03
In fact, the bound (n-1)•½•ULP(½) is likely useful for practical cases, as it is somewhat tight for reasonable n, and somebody using 32-bit floating-point likely has a small n (with a large n, they would be operating with some many numbers the errors of 32-bit floating-point would already be a problem for them), while somebody using 64-bit floating-point n does not have an n large enough for this bound to be a problem because that would be too many numbers to add in reasonable time. – Eric Postpischil Jan 02 '19 at 14:10
@EricPostpischil it is true that many of the values need to be small in order to trigger the worst case error, and much of the error can be avoided by sorting the values in ascending order first (or several other algorithmic tricks for improving numeric stability), but I feel that this is beyond the scope of this question. Feel free to add an answer of your own, however, as this *is* an interesting topic. – Hulk Jan 02 '19 at 14:13

score 1 · Answer 2 · answered Jan 03 '19 at 21:53

This question raises a number of cases that are interesting to explore. I have written some material below, but it is only a starting point.

In this answer, I use the conditions:

Numbers are represented in and arithmetic is performed in some IEEE-754 binary floating-point format using round-to-nearest-ties-to-even.
We have a sequence of numbers s₀, s₁, s₂,… s_n−1, in that format each of which is in (0, 1] and whose exact mathematical sum is 1.

(Note that zero is excluded from the interval. The presence of zeros will not affect any eventual sum, as adding zero will not change a value, so any sequence containing zeros may be reduced to a sequence without them by removing the zeros.)

Definition: is the difference between 1 and the next greater representable value, also called the Unit of Least Precision (ULP). (Note that ULP is often the unit of least precision for 1, while ULP(x) is the unit of least precision for a number at the magnitude of x, that is, scaled by the floating-point exponent.)

Definition: u is the unit round-off, the greatest error that may occur due to rounding for numbers in [1, 2). In a round-to-nearest mode, u is ½. In a directed rounding mode, such as toward-infinity or toward-zero, u is . (Currently, I am not otherwise considering directed rounding modes.)

Ordering by least bit is exact.

Definition: The trailing one bit of a number is the position value of the least significant 1 in its representation. For example, if a binary floating-point number is 1.011•2⁻⁵, its trailing one bit is 0.001•2⁻⁵ = 2⁻⁸.

Suppose we use this algorithm to sum the numbers:

Let S be a set containing the numbers in the sequence.
Select any two elements a and b in S whose trailing one bits are not greater than those of any other elements.
Replace a and b in S with the computed sum of a and b.
Repeat until S contains one element.

The number remaining in S is exactly 1, with no error.

Proof:

If some element a in S has a trailing bit that is less than 1 and not greater than the trailing bit of any other element in S, there must be some other element b in S with the same trailing bit. This is necessary because the sum of all the elements is 1, so the sum of the least non-zero bits must be zero in that position—there cannot be an odd number of 1 bits.
The sum of two numbers is at most twice the greater number, so the leading bit of the sum of a and b is at most one higher in position than the leading bit of either. But their trailing one bits sum to zero, so the trailing bit of the sum is at least one position higher, so the width of the exact sum is at most the width of the wider operand, so it is exactly representable, and the computed sum is exact.
So all arithmetic operations in the algorithm are exact, so the final sum is exact.

Sorting by magnitude

Suppose the numbers are sorted in ascending order, and we add them sequentially: S₀ = s₀, S₁ = S₀ + s₁, S₂ = S₁ + s₂,… S_n−1 = S_n−2 + s_n−1. Here, the S_i represent exact mathematical sums. Let T_i be the values we get by performing the same algorithm using floating-point arithmetic. This is well-known technique for reducing the rounding error when computing the sum of a set of numbers.

A bound on the error in the addition that produces each T_i is uT_i. For this initial, analysis, I will assume uS_i adequately approximates uS_i. Then a bound on the total error is uS_i summed over i, excluding i=0 (since there is no error in setting S₀ to s₀; the first addition occurs with S₁), which I will write sum(uS_i).

This equals u sum(S_i). And sum(S_i) = S₁ + S₂ + … S_n−1 = (s₀) + (s₀ + s₁) + (s₀ + s₁ + s₂) + ... = (n−2)•s₀ + (n−2)•s₁ + (n−3)•s₂ + 1•s_n−1. Permitting any real values in [0, 1], this sum is maximized when s₀ and the remaining s_i are all 1/(n−1). This cannot be achieved given our requirement the values are in (0, 1], and the constraints of the floating-point format may also prevent it, but it gives a greater sum than the floating-point values could provide, so it remains a valid bound (neglecting the earlier identification of T_i with S_i).

Then sum(S_i) is the sum of an arithmetic sequence of n−1 numbers from 1/(n−1) to 1, so their sum is ½(1/(n−1) + 1) • (n−1), and our bound in the error is u • ½(1/(n−1) + 1) • (n−1) = ½un.

Thus, in adding a million numbers, we can expect a maximum error around 250,000 . (Again, we have approximated by assuming the S_i stand in for the T_i.) Typical error would of course be much smaller.

Note that this is only a derivation of one bound. Other considerations not examined above may limit the error further. For example, if n is 2¹⁴⁹ when using IEEE-754 basic 32-bit binary floating-point, the bound above would be 2¹⁴⁷. However, the constraints of the problem necessitate that each s_i is exactly 2⁻¹⁴⁹, in which case T_2²⁴−1 is 2⁻¹²⁵ (no error has yet occurred) but then each subsequent T_i is also 2⁻¹²⁵ due to rounding (adding 2⁻¹⁴⁹ to 2⁻¹²⁵ yields 2⁻¹²⁵ due to the precision), so the final result is 2⁻¹²⁵, meaning the error is 1−2⁻¹²⁵, which is much less than 2¹⁴⁷.

Arbitrary ordering

Suppose we have no control over the order and must add the numbers in the order given. The maximum error in adding any two numbers in (0, 1) is ½u, so the total error in the first n−2 additions is ½u(n−2). The final addition might produce 1, so its rounding error is bound by u, not necessarily ½u, so the total error is bound by ½u(n−2) + u = ½un.

Discussion

The same bound was obtained for the sorted addition as for the arbitrary ordering. One cause is that the sorted addition uses the round-off error u algebraically, in uS_i, while the arbitrary ordering takes advantage of the fact that all numbers in [½, 1) have a round-off error of at most ½u—it uses the bottom of the interval, whereas the sorted addition passage uses a proportion. Thus, the sorted addition derivation could be improved. Additionally, its worst case has all numbers equal, meaning the sorting has no benefit. In general, sorting will improve the errors.

The behaviors of negative errors and positive errors will be asymmetric, so they should be considered separately.

Note that , the distance between 1 and its successor, i.e. ulp(1), is called the *machine epsilon*. ULP means *unit in the last place*; it is always relative to some number , thus one should write "ulp()", but when the number is implicit (or the binade is fixed), one can just write "ulp". — vinc17, Sep 24 '19 at 14:54

What is a suitable tolerance for expecting n floats to sum to 1

2 Answers2

Ordering by least bit is exact.

Sorting by magnitude

Arbitrary ordering

Discussion