38

I am reading a C book, talking about ranges of floating point, the author gave the table:

Type     Smallest Positive Value  Largest value      Precision
====     =======================  =============      =========
float    1.17549 x 10^-38         3.40282 x 10^38    6 digits
double   2.22507 x 10^-308        1.79769 x 10^308   15 digits

I dont know where the numbers in the columns Smallest Positive and Largest Value come from.

Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
ipkiss
  • 13,311
  • 33
  • 88
  • 123
  • 4
    They come from the range of the floating-point type. – Kendall Frey Apr 11 '12 at 14:34
  • 14
    The correct but useless answer would be "IEEE 754". – Jerry Coffin Apr 11 '12 at 14:35
  • 2
    Do you mean why are the limits those values? – SirGuy Apr 11 '12 at 14:35
  • Is IEEE 754 required by the C standard? – foo Apr 11 '12 at 14:59
  • @foo: C does not require IEEE-754 floating-point. – Stephen Canon Apr 11 '12 at 15:06
  • As I expected, so no assumption about representation can be drawn from IEEE 754. – foo Apr 11 '12 at 15:08
  • @foo, no. But the C standard provides optional features related to IEEE 754 (search for IEC 559 and IEC 60559 which are the IEC version of IEEE 754). – AProgrammer Apr 11 '12 at 15:12
  • 5
    @foo, IEEE 754 formats are common and some books are prone not to make a difference between implementation characteristics and language required one and some people are prone not to pay attention when the book is clear that it shows characteristic of one or several common implementations that are not required by the language. – AProgrammer Apr 11 '12 at 15:14
  • @AProgrammer Yet everyone here makes that assumption without mentioning that it's optional, and not necessarily the one that is used. – foo Apr 11 '12 at 15:15

6 Answers6

31

A 32 bit floating point number has 23 + 1 bits of mantissa and an 8 bit exponent (-126 to 127 is used though) so the largest number you can represent is:

(1 + 1 / 2 + ... 1 / (2 ^ 23)) * (2 ^ 127) = 
(2 ^ 23 + 2 ^ 23 + .... 1) * (2 ^ (127 - 23)) = 
(2 ^ 24 - 1) * (2 ^ 104) ~= 3.4e38
Andreas Brinck
  • 51,293
  • 14
  • 84
  • 114
  • 1
    Here I explain why the weird `-126` to `127` range which contains only 253 numbers and not the expected 255 (one is for infinity and the other subnormals): https://stackoverflow.com/a/53204544/895245 – Ciro Santilli OurBigBook.com Nov 13 '18 at 13:20
22

These numbers come from the IEEE-754 standard, which defines the standard representation of floating point numbers. Wikipedia article at the link explains how to arrive at these ranges knowing the number of bits used for the signs, mantissa, and the exponent.

Sergey Kalinichenko
  • 714,442
  • 84
  • 1,110
  • 1,523
8

The values for the float data type come from having 32 bits in total to represent the number which are allocated like this:

1 bit: sign bit

8 bits: exponent p

23 bits: mantissa

The exponent is stored as p + BIAS where the BIAS is 127, the mantissa has 23 bits and a 24th hidden bit that is assumed 1. This hidden bit is the most significant bit (MSB) of the mantissa and the exponent must be chosen so that it is 1.

This means that the smallest number you can represent is 01000000000000000000000000000000 which is 1x2^-126 = 1.17549435E-38.

The largest value is 011111111111111111111111111111111, the mantissa is 2 * (1 - 1/65536) and the exponent is 127 which gives (1 - 1 / 65536) * 2 ^ 128 = 3.40277175E38.

The same principles apply to double precision except the bits are:

1 bit: sign bit

11 bits: exponent bits

52 bits: mantissa bits

BIAS: 1023

So technically the limits come from the IEEE-754 standard for representing floating point numbers and the above is how those limits come about

SirGuy
  • 10,660
  • 2
  • 36
  • 66
4

Infinity, NaN and subnormals

These are important caveats that no other answer has mentioned so far.

First read this introduction to IEEE 754 and subnormal numbers: What is a subnormal floating point number?

Then, for single precision floats (32-bit):

  • IEEE 754 says that if the exponent is all ones (0xFF == 255), then it represents either NaN or Infinity.

    This is why the largest non-infinite number has exponent 0xFE == 254 and not 0xFF.

    Then with the bias, it becomes:

    254 - 127 == 127
    
  • FLT_MIN is the smallest normal number. But there are smaller subnormal ones! Those take up the -127 exponent slot.

All asserts of the following program pass on Ubuntu 18.04 amd64:

#include <assert.h>
#include <float.h>
#include <inttypes.h>
#include <math.h>
#include <stdlib.h>
#include <stdio.h>

float float_from_bytes(
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    uint32_t bytes;
    bytes = 0;
    bytes |= sign;
    bytes <<= 8;
    bytes |= exponent;
    bytes <<= 23;
    bytes |= fraction;
    return *(float*)&bytes;
}

int main(void) {
    /* All 1 exponent and non-0 fraction means NaN.
     * There are of course many possible representations,
     * and some have special semantics such as signalling vs not.
     */
    assert(isnan(float_from_bytes(0, 0xFF, 1)));
    assert(isnan(NAN));
    printf("nan                  = %e\n", NAN);

    /* All 1 exponent and 0 fraction means infinity. */
    assert(INFINITY == float_from_bytes(0, 0xFF, 0));
    assert(isinf(INFINITY));
    printf("infinity             = %e\n", INFINITY);

    /* ANSI C defines FLT_MAX as the largest non-infinite number. */
    assert(FLT_MAX == 0x1.FFFFFEp127f);
    /* Not 0xFF because that is infinite. */
    assert(FLT_MAX == float_from_bytes(0, 0xFE, 0x7FFFFF));
    assert(!isinf(FLT_MAX));
    assert(FLT_MAX < INFINITY);
    printf("largest non infinite = %e\n", FLT_MAX);

    /* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
    assert(FLT_MIN == 0x1.0p-126f);
    assert(FLT_MIN == float_from_bytes(0, 1, 0));
    assert(isnormal(FLT_MIN));
    printf("smallest normal      = %e\n", FLT_MIN);

    /* The smallest non-zero subnormal number. */
    float smallest_subnormal = float_from_bytes(0, 0, 1);
    assert(smallest_subnormal == 0x0.000002p-126f);
    assert(0.0f < smallest_subnormal);
    assert(!isnormal(smallest_subnormal));
    printf("smallest subnormal   = %e\n", smallest_subnormal);

    return EXIT_SUCCESS;
}

GitHub upstream.

Compile and run with:

gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out

Output:

nan                  = nan
infinity             = inf
largest non infinite = 3.402823e+38
smallest normal      = 1.175494e-38
smallest subnormal   = 1.401298e-45
Ciro Santilli OurBigBook.com
  • 347,512
  • 102
  • 1,199
  • 985
2

As dasblinkenlight already answered, the numbers come from the way that floating point numbers are represented in IEEE-754, and Andreas has a nice breakdown of the maths.

However - be careful that the precision of floating point numbers isn't exactly 6 or 15 significant decimal digits as the table suggests, since the precision of IEEE-754 numbers depends on the number of significant binary digits.

  • float has 24 significant binary digits - which depending on the number represented translates to 6-8 decimal digits of precision.

  • double has 53 significant binary digits, which is approximately 15 decimal digits.

Another answer of mine has further explanation if you're interested.

Community
  • 1
  • 1
Timothy Jones
  • 21,495
  • 6
  • 60
  • 90
1

It's a consequence of the size of the exponent part of the type, as in IEEE 754 for example. You can examine the sizes with FLT_MAX, FLT_MIN, DBL_MAX, DBL_MIN in float.h.

foo
  • 387
  • 2
  • 9