How to calculate the range, maximu, minimum value of a floating data type?

Question

I know how to calculate range,maximu, minimum value for int types data. Like short int, int, long int, char. For example if char is 1 byte for signed char the minimum value will be -2^(1byte*8-1) and maximum will be found by +2^(1byte*8-1) -1 Again range will be maximum-minimum+1. But with these formula i am not figuring out the maximum and minmum value for a float type. In c++ the minimum and maximum value for a float type is 3.4*10^(-38) and 3.4*10^(+38). Please someone helps me to know how to calculate the maximum, minimum value for floating types in simple and easy way. Because there may be the ans of this question asked in the past. But didn’t understand those definitio. So describe me so that i can understand easily

It's a little more complicated for floating point numbers, due to the way they are stored. https://stackoverflow.com/questions/21895756/why-are-floating-point-numbers-inaccurate has a pretty comprehensive explanation — Matt, Jul 20 '19 at 05:52

john · Answer 1 · 2019-07-20T06:22:12.850

1

For a double the mantissa (aka the significand) is 53 bits and the exponent is 11 bits. Assuming we calculate the value of the floating point with the formula m*2^e where m is a 53 bit integer then the exponent range is [-1075,971]. These values are specified by the IEEE 754 standard.

So maximum value is

(2^53-1)*2^971

and smallest strictly positive value is

2^-1075

where ^ means to the power of.

I am assuming that the compiler uses the IEEE 754 standard, which isn't required by C++, but in practise will always be the case.

edited Jul 20 '19 at 06:22

answered Jul 20 '19 at 06:00

john

85,011
4
57
81

How did you get 1023 and -1022 – Jul 20 '19 at 06:09
Let, double data has 8 byte . which is 64 bit. And you are telling 53 bit as significan. So remaining is 11 bit.from here how do you derived -1022 and 1023 – Jul 20 '19 at 06:10
@TusharAhmed 1023 and -1022 are the values specified by the IEEE 754 standard. – john Jul 20 '19 at 06:11
@TusharAhmed In a 64 bit double, there's 1 sign bit, 11 exponent bits, and 52 mantissa bits. But the mantissa is *normalised* so in effect there are 53 mantissa bits. – john Jul 20 '19 at 06:13
@BenVoigt I've clarified that. – john Jul 20 '19 at 06:14
@BenVoigt Ooops, hopefully correct now, but I'm going to double check. – john Jul 20 '19 at 06:23

score 1 · Answer 2 · edited Jun 20 '20 at 09:12

This answer discusses only IEEE-754 binary interchange formats.

First, we must understand the format in which floating-point numbers are encoded. IEEE-754 specifies that a binary floating-point number is represented with:

a 1-bit sign S,
a w-bit biased exponent e, and
a p-bit significand f, which is primarily encoded with t = p−1 bits.

The exponent e is encoded by adding a bias, so the actual value E stored in the w bits is E = e + bias. The bias is specified to be 2^k−p−1−1, where k is the width of the format (such as 32 for a 32-bit float).¹ The precision p is specified to be 11 and 24 for 16-bit and 32-bit widths and k-round(4•log2(k))+13 for other widths. Note that k−p = k−(p−1)-1 equals the width of the exponent field, w, as taking the entire encoding (k bits) and removing the significand encoding (p−1 bits) and the sign bit (1 bit) leaves just the exponent encoding, so the bias 2^k−p−1−1 equals 2^w−1−1.

The value of the exponent field that has all ones in its binary representation, 2^w−1, is reserved for special purposes (NaN and ∞). So the maximum value the field can have for normal numbers is E = 2^w−2. Then the maximum value the represented exponent can have is e = E − bias = (2^w−2) − (2^w−1−1) = 2^w−1−1. (The maximum normal exponent value equals bias.) Also, the exponent field of zero is special, and e is specified to be 1-bias in this case.

The significand f is stored by putting its trailing p−1 bits in the significand field. The leading bit is inferred from the exponent field. If the exponent is not zero and is not the reserved value with all ones, then the significand f is specified to be 1 + T•2^1−p, where T is the binary number stored by the t bits in the significand field. Note that the largest value of the significand field, when all its bits are set, is 2^p−1−1.

If the exponent is zero, the significand f is specified to be 0 + T•2^1−p.

When the exponent field does not have the special all-ones value or the zero value, the value represented by this encoding is (−1)^S • 2^e • f. When the exponent field is zero, the value represented is (−1)^S • 2^1-bias • f.

Now we can figure out the minimum and maximum values. Of course, the minimum and maximum values representable in this format are −∞ and +∞, and the minimum magnitude is 0. But we are also interested in the minimum non-zero magnitude and the maximum finite number. (The minimum finite number is the negation of the maximum finite number.)

The maximum finite value occurs when the sign bit is zero, the exponent has its largest non-special value and the significand field has all its bits set. Then e = 2^w−1−1, and T = 2^p−1−1. So f = 1 + (2^p−1−1)•2^1−p = 2 − 2^1−p, and the number represented is (−1)⁰ • 2^{2^w−1−1} • (2 − 2^1−p).

For the 32-bit width, w = 8 and p = 24, so the maximum value is 2^{2⁸⁻¹−1} • (2 − 2¹⁻²⁴) 2¹²⁷ • (2 − 2⁻²³) = 2¹²⁸−2¹⁰⁴.

The minimum non-zero magnitude occurs when the exponent encoding E has its minimum value, zero, and the significand encoding T has its minimum non-zero value, one. Then the exponent e = 1 − bias, and the significand f = 0 + T•2^1−p = 1•2^1−p = 2^1−p. The number represented is (−1)^S • 2^1−bias • 2^1−p.

For the 32-bit format, bias = 127 and p = 24, so the minimum non-zero magnitude is 2¹⁻¹²⁷ • 2¹⁻²⁴ = 2⁻¹⁴⁹.

Footnote

¹ Only formats of widths 16, 32, 64, and multiples of 32 that are at least 128 are specified.

How to calculate the range, maximu, minimum value of a floating data type?

2 Answers2

Footnote

Linked