What is the minimum number I need to add to get Infinity for 1 byte floating point

Question

I'm trying to understand what minimum number I need to add to get Infinity because of overflow. I've read this answer already. So let me just clarify my understanding here. To simplify, I'll be working with 1 byte floating point with 4 bits for exponent and 3 bits for mantissa:

0 0000 000

The maximum positive number I can store in it is this:

0 1110 111

which is when converted to scientific notation:

   1.111 x 2^{7} = 11110000

Is my understanding correct that the minimum number I should add to get Infinity is 00010000:

       11110000
+      00010000
        --------
     1 00000000

As I understand anything less than 00010000 will not cause overflow and the result will be rounded to 11110000. But the 00010000 is 0 0000 001 in floating point format, and it's the number 1. So is adding just 1 enough to cause overflow?

score 2 · Accepted Answer · edited May 23 '17 at 12:03

2

The answer is given in the other answer to the question you link to. The smallest value which will round to infinity is:

c = 2⁷ × ( 2 − ½ × 2^1-4 ) = 1.9375 × 2⁷ = 1.1111₂ × 2⁷

So the smallest value that you can add to get infinity is

c − f_max = 1.1111₂ × 2⁷ − 1.111₂ × 2⁷ = 0.0001₂ × 2⁷ = 2³

which, if I understand correctly, would have bit pattern 0 1010 000 in your proposed format.

UPDATE: so why is it this particular cutoff?

Suppose that there was another binade above this one, then the next floating point number would be

x = 1.000₂ × 2⁸

Note that c is the value that is exactly halfway between x and f_max. In other words, the values which would round up to x are instead rounded to infinity, but the values which would round down to f_max still round to the same value.

edited May 23 '17 at 12:03

Community

1
1

answered Oct 18 '16 at 20:08

Simon Byrne

7,694
1
26
50

thanks, but the number you calculated `2^3=1000` if added to `1111 0000` will produce `1111 1000`, which doesn't seem to cause overflow, simply the fifth bit is set to 1 and the number will be rounded to max_value. I would expect that overflow to turn entire mantissa to zeros with `1` going over the available bits. Just like for example if you add two 1 byte binary numbers `1111 1111` and `0000 0001` and you get an overflow `1 0000 0000` (9 bits can't be put, so the first 1 is discarded). Is it not so with overflow in floating point? – Max Koretskyi Oct 20 '16 at 05:11
what is max_value? – Simon Byrne Oct 20 '16 at 08:34
in my 1 byte floating point it's `0 1110 111`. I'm assuming that when floating point adds two numbers, it converts it from scientific from into plain representation, like `0.001 x 2^2 + 0.011 x 2^1 = 0.1+0.11` and adds them like this, then it converts back to scientific form and to floating point representation. Maybe this is where my confusion mentioned in the previous comment comes from – Max Koretskyi Oct 20 '16 at 09:52
I think you're trying to think of it in terms of integer operations: floating point overflow is different than integer overflow. If you want to learn how floating point addition is performed, look at something like this http://pages.cs.wisc.edu/~markhill/cs354/Fall2008/notes/flpt.apprec.html – Simon Byrne Oct 20 '16 at 12:06
yeah, mostly I'm trying to understand the rationale behind the formula specified in the answer your linked and the phrase 'you need to increase MAX_VALUE enough to actually affect the mantissa' in the answer I linked to. I was thinking that affecting is mantissa is actually causing a integer like overflow. What's your take on `affecting the mantissa`? – Max Koretskyi Oct 20 '16 at 13:11
I think that answer is wrong (or, at the very least, unhelpful). Another way to think about it is how it would round if there was another binade above this one. – Simon Byrne Oct 20 '16 at 13:28
So you would suggest just remember the formula and not bother why it's such? – Max Koretskyi Oct 20 '16 at 13:46
I've updated the answer, hopefully that explains it further. – Simon Byrne Oct 20 '16 at 14:04
thanks a lot, I'll go though your explanation and I'll probably get back to you with question :) – Max Koretskyi Oct 20 '16 at 14:48
_Suppose that there was another binade above this one_ - can you elaborate a bit? there's not much on _binade_ term on the web – Max Koretskyi Oct 27 '16 at 07:28
A binade is a set of all floating point numbers with same exponent, see https://en.wikipedia.org/wiki/Binade – Simon Byrne Oct 27 '16 at 20:32
yeah, I've seen that one sentence definition, didn't much help. _another binade above this one_ - what are these two numbers? – Max Koretskyi Oct 28 '16 at 05:48
In this case, it's the set of floating point numbers with exponent of 2^8 – Simon Byrne Oct 28 '16 at 12:25

What is the minimum number I need to add to get Infinity for 1 byte floating point

1 Answers1