How double precision floating point number is stored and calculated?

Question

I'm really curious about how Double Precision Floating point number is stored.

These are things I figured out so far.

They require 64 bits in memory
Consist of three parts
- Sign bit (1 bit long)
- Exponent (11 bit long)
- Fraction (53 bits, the first bit is assumed always to be 1, thus only 52 are stored, except when all 52 bits are 0. Then leading bit is assumed to be 0)

However I do not uderstand what is exponent, exponent bias and all those formulas in wikipedia page.

Can anyone explain me what are all those things, how they work and eventually calculated to the real number step by step?

Possible duplicate of http://stackoverflow.com/questions/6535343/c-how-is-double-number-e-g-123-45-stored-in-a-float-variable-or-double-vari Look at the best answer. It is the same thing but with greater number of bits for mantissa and exponent. — Ram, Feb 04 '12 at 15:22
possible duplicate of [Why IEEE754 single-precision float has only 7 digit precision?](http://stackoverflow.com/questions/19130396/why-ieee754-single-precision-float-has-only-7-digit-precision) Answer and first comment explains everything concisely — Ron, Dec 13 '13 at 23:47
How IEEE754 works and how the "real number" (of precision) is calculated are 2 different questions. I would suggest you start with learning how IEEE754 floating point works, then have a look at the above linked question — Ron, Dec 13 '13 at 23:54

score 2 · Accepted Answer · answered Feb 04 '12 at 15:37

Check out the formula a little further down the page:

Except for the above exceptions, the entire double-precision number is described by:

(-1)^sign * 2^(exponent - bias) * 1.mantissa

The formula means that for non-NAN, non-INF, non-zero and non-denormal numbers (which I'll ignore) you take the bits in the mantissa and add an implicit 1 bit at the top. This makes the mantissa 53 bits in the range 1.0 ... 1.111111...11 (binary). To get the actual value, you multiply the mantissa by the 2 to the power of the exponent minus the bias (1023) and either negate the result or not depending on the sign bit. The number 1.0 would have an unbiased exponent of zero (i.e. 1.0 = 1.0 * 2^0) and its biased exponent would be 1023 (the bias is just added to the exponent). So, 1.0 would be sign = 1, exponent = 1023, mantissa = 0 (remember the hidden mantissa bit).

Putting it all together in hexadecimal the value would be 0x3FF000000000 == 1.0.

Sorry, I don't understand how you can "[...] multiply mantissa by the 2 to the power of the exponent [...]". Mantissa is a "sting" of 1s and 0s, and so is the exponent. Should this be preceded by something like "Convert the binary representation of mantissa into a decimal by ... "? — Confounded, Jun 15 '18 at 00:00

score 1 · Answer 2 · answered Feb 04 '12 at 15:49

Sign: 1 if negative 0 if positive
Fraction: the engeneering floating rappresentation in binary mode.
Exponent: is the exponent e such that fraction * 2^e is equal to the number that i want to rappresent.
The bias is a number that must be subtracted to the exponent to have the correct rappresentation. In double precision is 1023, in single precision 127.

an example (in single precision couse is more comfortable for me to write =)): if i had to rappresent -0.75 i do: - binary rappresentation will be -11 * 2^-2 = -1.1 * 2^-1

sign = 1
fraction = 1 + .1000....
biassed exponent: -1 + 127 = 126 -> 01111110

so we had -0.75 = 1 01111110 10000000000000000000000

For the sum you have to align the exponent and then you can sum the fracional part.

For multiplication you have to

sum the exponent and subracting the bias
multuply the fractional part
rounding the result
look at the sign (if you have same sign so sign = 0 else sign = 1)

score 0 · Answer 3 · answered Sep 30 '15 at 09:44

0

    int main()
    {
         double num = 5643.0662;
         int sign = 0;
         int exponent = 1035;
         int exponent_bias = 1023;
         float mantissa = 0.0662;

          double x = pow(-1,sign) * pow(2,(exponent - exponent_bias)) * (1+mantissa);
         int y = num - x;

       cout << "\nValue of x is : " << x << endl;
       cout << "\nValue of y is : " << y << endl;

      return 0;
  }

answered Sep 30 '15 at 09:44

s.zen

11
6

1

@please if there mistake in this code and you want to correct it so please fill free and correct it. – s.zen Sep 30 '15 at 09:46

How double precision floating point number is stored and calculated?

3 Answers3