How is addition of double is done here and a valid way to detect that the result could not be represented

Question

I am trying to understand double a bit better. In the following code snippet min and max are double:

double min = 3.472727272727276;
double max = 3.4727272727272767;
System.out.println(max - min);  
System.out.println((max - min)/2);
double mid = min + ((max - min)/2);
if(min == mid) {
    System.out.println("equal");
}
System.out.println(mid);

The first 2 print statements print:

4.440892098500626E-16
2.220446049250313E-16

Which basically is:
0.0000000000000004440892098500626 and 0.0000000000000002220446049250313

Then the conditional check is true i.e. prints equal and the last print is: 3.472727272727276

So from my understanding the (max - min)/2 gave a value that could be represented by a double.
What is not clear to me is what is happening during the addition.

Is the addition creating a number that could not be represented by a double and leaves the original min as is by dropping of digits or is the number effectively considered as 0 before the addition actually happens or how exactly is this done?
Is the min == mid a valid check to detect such issues with doubles? I.e. with integer we can detect overflow/underflow by checking if the result is less that we started with. Is the equality check after doing an addition a sane/reasonable check to detect the equivalent problem with double i.e. that the number added was not really enhanced due to rounding error (or what exactly is the actual term for this)?

Does this answer your question? [What's wrong with using == to compare floats in Java?](https://stackoverflow.com/questions/1088216/whats-wrong-with-using-to-compare-floats-in-java) — WJS, Aug 03 '20 at 14:13
@WJS: No because I know why we are not supposed to use `==`. The part about the `==` is only about this particular case which I am trying to understand if it is the direct equivalent of overflow/underflow. — Jim, Aug 03 '20 at 14:16
Have you searched the site for the limitations of floating point math? This topic has been broached many times. — WJS, Aug 03 '20 at 14:30
This site, Stack Overflow. And I meant to say related topics. — WJS, Aug 03 '20 at 15:24
Re “Is the equality check after doing an addition a sane/reasonable check to detect the equivalent problem with `double` i.e. that the number added was not really enhanced due to rounding error (or what exactly is the actual term for this)?”: What are you trying to detect? There was no overflow or underflow here. There was a rounding error. For the most part, rounding errors are considered normal in floating-point code. Generally, you cannot detect them with a simple comparison. You could in this case because you were adding two adjacent representable numbers. But that is an unusual case. — Eric Postpischil, Aug 03 '20 at 15:54
Floating-point hardware generally does offer a way to detect rounding errors. An exception can be raised when a result is not exact. I do not believe Java provides access to this information. Exceptions for rounding errors are rarely used because the overwhelmingly prevalent use of floating-point arithmetic is to approximate real arithmetic, so rounding errors are expected and accepted. Only rarely is floating-point arithmetic used to perform exact calculations, and it requires some amount of expertise. — Eric Postpischil, Aug 03 '20 at 15:56
@EricPostpischil: `What are you trying to detect? ` e.g in case of binary search with floats I was wondering if that kind of detection for overflow is valid or not. — Jim, Aug 03 '20 at 20:32
@Jim: There is no overflow here. Overflow is when the result of an operation exceeds in magnitude the largest representable finite number (after allowing for the normal rounding that occurs). Overflow happens with large numbers. There is no overflow with numbers near 3.47. Nor is there underflow. Underflow is when a result is less than the smallest representable normal value (possibly allowing for normal rounding), so that the normal full precision of the format is not available. You may be thinking of something like inexactness. — Eric Postpischil, Aug 03 '20 at 20:42
@Jim: Or possibly you are thinking of a specific case of inexactness where one operand is such that the result is the other operand, unchanged by the operation, such as adding a tiny number to a large number, so the result is the large number, unchanged. There is no specific term for that as far as I know. To explain what you are trying to detect, explain how you would make use of the detection result. What would your code do with it? How would it affect your program? — Eric Postpischil, Aug 03 '20 at 20:44
In this particular case, since there is a range min/max and mid, it effectively is a binary search (loop) for double. For binary search usually we stop as soon as min (or left) reaches max (or right). That code posted can only stop by doing that equality check i.e. that `mid` has not increased over `min` after the addition. But I wasn't sure if that is a valid check in general and wanted to understand this specific situation and behavior better (I do know in general not to compare doubles for equality) — Jim, Aug 03 '20 at 21:02

score 2 · Answer 1 · answered Aug 03 '20 at 15:37

2

For this example, it is easy to see what is happening by viewing the numbers in a hexadecimal floating-point format. The result of converting the source text 3.472727272727276 to double is 3.47272727272727621539161191321909427642822265625, which, using hexadecimal, is:

    1.BC8253C8253D0₁₆•2¹

Observe there are exactly 53 bits in the significand—one before the “.” and 52 in 13 hexadecimal digits after it. The double format has one bit for the sign, 11 for the exponent, and 53 for the significand. (52 are stored explicitly; one is encoded via the exponent.)

Converting the source text 3.4727272727272767 to double yields 3.472727272727276659480821763281710445880889892578125, which is:

    1.BC8253C8253D1₁₆•2¹

Now we can easily see what happens with arithmetic on them. Their difference is:

    0.0000000000001₁₆•2¹

When we normalize that, it is 1.₁₆•2¹⁻⁵² = 1.₁₆•2⁻⁵¹ ≈ 4.44•10⁻¹⁶, and the double format can easily represent half of that simply by adjusting the exponent. Then we have 1.₁₆•2⁻⁵² ≈ 2.22•10⁻¹⁶.

However, when we try to add that halved difference to the first number, the result with real-number arithmetic is:

    1.BC8253C8253D08₁₆•2¹

Observe this has 54 bits—one before the “.”, then 52 in 13 hexadecimal digits, and a final one in the high bit of that 14^th digit, the 8. The double format does not have 54 bits in its significand, so addition in double format cannot produce this result. Instead, the sum is rounded to the nearest representable value or, in case of a tie, to the nearest representable value with an even low bit. So the result is 1.BC8253C8253D08₁₆•2¹, which is the same as min.

answered Aug 03 '20 at 15:37

Eric Postpischil

195,579
13
168
312

What do you mean by `converting the source text 3.472727272727276 to double`? What is "source text" and how did you do the conversion? – Jim Aug 03 '20 at 20:31
@Jim: Source text is the characters in your program source code. The conversion is done by the compiler. “3.472727272727276” is a numeral for the number 3.472727272727276, but that number is not representable in the `double` format. The nearest representable value is 3.47272727272727621539161191321909427642822265625, so that is what the compiler produces when it compiles `3.472727272727276` in the source code. – Eric Postpischil Aug 03 '20 at 20:37
I see. I guess this goes the other way round because I got that number via the debugger output having that code snippet run in a loop. The debugger printed the source text version of the number you wrote right? – Jim Aug 03 '20 at 20:59
@Jim: Debuggers and all sorts of common printing defaults often show just an approximation of the internal value. Some just use a fixed precision, such as 15 digits (but removing trailing zeros after rounding to 15 digits). Java’s default is to print just enough digits to uniquely distinguish the number from the neighboring representable values. – Eric Postpischil Aug 03 '20 at 21:21
How did you figure from that text source what is the actual binary representation i.e. the longer digit? – Jim Aug 04 '20 at 11:51
Also what do you mean by `When we normalize that, it is`? – Jim Aug 04 '20 at 11:53
@Jim: In this context, normalizing means to adjust the significand (the *x* in *x*•2^*e* so that it is at least 1 but less than 2 by multiplying or dividing it by a power of two and to adjust the exponent (*e*) accordingly. So with 0.0000000000001•2^1, we multiply the significand by 2^52 to change 0.0000000000001 into 1.0000000000000, and we add −52 to the exponent 1 to make it −51, giving 1.•2^−51. The number is identical mathematically, but the normal form corresponds to how it is represented in a floating-point format. – Eric Postpischil Aug 04 '20 at 12:26
@Jim: To get the actual `double` value from the source text, I used a compiler known to convert numbers correctly (Apple’s distribution of Clang). It can be done mathematically by hand, but that would be tedious. Basically, one would take 3.472727272727276, recognize that it is between 2^1 and 2^2, so, in the `double` format, its high bit has position value 2^1, so its low bit must have position value 2^(1-52) (because there are 53 bits in the significand of `double`). So then we multiply 3.472727272727276 by 2^(1-52). At that scale, the significand of a `double` must be a 53-bit integer… – Eric Postpischil Aug 04 '20 at 12:29
… So we round to the nearest integer. Then we renormalize by dividing by 2^(1-52), and that is the `double` value nearest 3.472727272727276. – Eric Postpischil Aug 04 '20 at 12:29

Joni · Answer 2 · 2020-08-03T17:45:44.057

Is the addition creating a number that could not be represented by a double

The algorithm for adding two floating point numbers as a first step brings the two numbers to the same exponent. Effectively this is done by shifting the bits of the smaller number to the right, and the bits that underflow are lost (become zero).

If the calculation is done with 64-bit precision,

3.472727272727276 + 2.220446049250313E-16     or in hex:
0x1.bc8253c8253dp1 + 0x1.0p-52

in effect becomes the calculation

3.472727272727276 + 0.0     or in hex:
0x1.bc8253c8253dp1 + 0x0.0p1

and this happens in hardware, so the intermediate 0.0 value is not stored anywhere or visible as a separate step.

But: it's possible the calculation is done with higher precision than 64 bits. For example if 80-bit precision floating point CPU instructions are available, the JVM is allowed to use them. In that case the intermediate results will be different, but the end result is still going to be the same because the result has to be stored as a 64-bit double.

Is the min == mid a valid check to detect such issues with doubles?

Depends on what you need to do. The == operator checks if the two numbers are exactly equal, for better or for worse. In many places people don't want exact equality because it's difficult or impossible to achieve: for example Math.sin(Math.PI) is not going to be exactly 0 but you may prefer to pretend it's "close enough" to 0.

@downvoter I welcome any suggestions to improve in this answer — Joni, Aug 03 '20 at 19:40

aheger24 · Answer 3 · 2020-08-03T14:36:12.503

0

The following code may demonstrate the issue:

double num = 1;
while (!Double.isInfinite(num)) {
    num *= 2;
    System.out.println(num);
}
System.out.println("-----------------------");
System.out.println("-- now the opposite----");
System.out.println("-----------------------");
num = 1;
while (num > 0) {
    num /= 2;
    System.out.println(num);
}

The space in the memory is limited by the number of bits. Thus it is inevitable that at some point a very small number will be exactly zero.

In your calculation, the operators act on doubles creating temporary doubles in the CPU - which also fall under the precision limit and thus in your case become zero

And of course, the == operator must be used with diligence on doubles, but that was not the problem here.

To answer your second question, you need to use BigDecimal instead of double to be on the safe side.

The problem with checking is, that the values that any double can assume are not evenly distributed. Between 0 and 1, there is a similar number of values a double can assume than between 1 and Infinity.

EDIT: yes, the result of mid == min is of course a proof that the double precision limit has been reached. But the inverse mid != min does not prove that the limit may have been reached in another step.

In a general program that operates on arbitrary input doubles you would need to do that sort of check with every intermediate calculation result. I think it is not worth the effort compared to using BigDecimal and also you run the risk of forgetting some checks.

edited Aug 03 '20 at 14:36

answered Aug 03 '20 at 14:29

aheger24

79
7

So the `0.0000000000000002220446049250313` is stored in the CPU as is, and then after the addition the output affects digits after the last 6 of `3.472727272727276`? I mean you note `at some point a very small number will be exactly zero.` which I get, but `(max - min)/2 ` hasn't reached 0 yet since it is printed as `2.220446049250313E-16`. – Jim Aug 03 '20 at 14:39
1

So is the number considered as 0 when copied to the register or is the addition resulting in dropping of digits? – Jim Aug 03 '20 at 15:07
for (int i = 0; i < 1000000; i++) { min += 1E-16; System.out.println(min); } It always stays the same. – aheger24 Aug 03 '20 at 15:07
But to me it still leaves the question in my previous comment open. I still am not sure at which point moving to the register or after the addition or moving back to memory the number is reduces to 0 – Jim Aug 03 '20 at 15:10
1

In the register, I think it can be stored, since you can print the nonzero value. But in the computation, the CPU only has the same bit width as the register, so it needs to map. If the number to be mapped is to small, it gets lost there. The rules of the mapping are very important, so that double operations yield the same result no matter what CPU they are executed on. – aheger24 Aug 03 '20 at 15:12
The reasion for the mapping to fail is, because your number is already fairly large. 3,4 is above 1, there the allowed double values thin out. Ok, they thin out much more the closer you get to Infinity. Now consider the number 6E-15, which is not even a small number as exponents go down to the E-324 range. To 6E-15, you can correctly add 1E16, as I will do again in a separate line: – aheger24 Aug 03 '20 at 15:22
x = 0.000000000000006; for (int i = 0; i < 100; i++) { x += 1E-16; System.out.println(x); } – aheger24 Aug 03 '20 at 15:23

How is addition of double is done here and a valid way to detect that the result could not be represented

3 Answers3