IEEE 754 basics
First let's review the basics of IEEE 754 numbers are organized.
Let's focus on single precision (32-bit) first.
The format is:
- 1 bit: sign
- 8 bits: exponent
- 23 bits: fraction
Or if you like pictures:

Source.
The sign is simple: 0 is positive, and 1 is negative, end of story.
The exponent is 8 bits long, and so it ranges from 0 to 255.
The exponent is called biased because it has an offset of -127
, e.g.:
0 == special case: zero or subnormal, explained below
1 == 2 ^ -126
...
125 == 2 ^ -2
126 == 2 ^ -1
127 == 2 ^ 0
128 == 2 ^ 1
129 == 2 ^ 2
...
254 == 2 ^ 127
255 == special case: infinity and NaN
The leading bit convention
While designing IEEE 754, engineers noticed that all numbers, except 0.0
, have a one 1
in binary as the first digit
E.g.:
25.0 == (binary) 11001 == 1.1001 * 2^4
0.625 == (binary) 0.101 == 1.01 * 2^-1
both start with that annoying 1.
part.
Therefore, it would be wasteful to let that digit take up on precision bit almost every single number.
For this reason, they created the "leading bit convention":
always assume that the number starts with one
But then how to deal with 0.0
? Well, they decided to create an exception:
- if the exponent is 0
- and the fraction is 0
- then the number represents plus or minus
0.0
so that the bytes 00 00 00 00
also represent 0.0
, which looks good.
If we only considered these rules, then the smallest non-zero number that can be represented would be:
which looks something like this in an hex fraction due to the leading bit convention:
1.000002 * 2 ^ (-127)
where .000002
is 22 zeroes with a 1
at the end.
We cannot take fraction = 0
, otherwise that number would be 0.0
.
But then the engineers, who also had a keen artistic sense, thought: isn't that ugly? That we jump from straight 0.0
to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow?
Denormal numbers
The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:
If the exponent is 0, then:
- the leading bit becomes 0
- the exponent is fixed to -126 (not -127 as if we didn't have this exception)
Such numbers are called subnormal numbers (or denormal numbers which is synonym).
This rule immediately implies that the number such that:
is 0.0
, which is kind of elegant as it means one less rule to keep track of.
So 0.0
is actually a subnormal number according to our definition!
With this new rule then, the smallest non-subnormal number is:
- exponent: 1 (0 would be subnormal)
- fraction: 0
which represents:
1.0 * 2 ^ (-126)
Then, the largest subnormal number is:
- exponent: 0
- fraction: 0x7FFFFF (23 bits 1)
which equals:
0.FFFFFE * 2 ^ (-126)
where .FFFFFE
is once again 23 bits one to the right of the dot.
This is pretty close to the smallest non-subnormal number, which sounds sane.
And the smallest non-zero subnormal number is:
which equals:
0.000002 * 2 ^ (-126)
which also looks pretty close to 0.0
!
Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever they did in the 70s instead.
As you can see, subnormal numbers do a trade-off between precision and representation length.
As the most extreme example, the smallest non-zero subnormal:
0.000002 * 2 ^ (-126)
has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:
0.000002 * 2 ^ (-126) / 2
we actually reach 0.0
exactly!
Runnable C example
Now let's play with some actual code to verify our theory.
In almost all current and desktop machines, C float
represents single precision IEEE 754 floating point numbers.
This is in particular the case for my Ubuntu 18.04 amd64 laptop.
With that assumption, all assertions pass on the following program:
subnormal.c
#if __STDC_VERSION__ < 201112L
#error C11 required
#endif
#ifndef __STDC_IEC_559__
#error IEEE 754 not implemented
#endif
#include <assert.h>
#include <float.h> /* FLT_HAS_SUBNORM */
#include <inttypes.h>
#include <math.h> /* isnormal */
#include <stdlib.h>
#include <stdio.h>
#if FLT_HAS_SUBNORM != 1
#error float does not have subnormal numbers
#endif
typedef struct {
uint32_t sign, exponent, fraction;
} Float32;
Float32 float32_from_float(float f) {
uint32_t bytes;
Float32 float32;
bytes = *(uint32_t*)&f;
float32.fraction = bytes & 0x007FFFFF;
bytes >>= 23;
float32.exponent = bytes & 0x000000FF;
bytes >>= 8;
float32.sign = bytes & 0x000000001;
bytes >>= 1;
return float32;
}
float float_from_bytes(
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
uint32_t bytes;
bytes = 0;
bytes |= sign;
bytes <<= 8;
bytes |= exponent;
bytes <<= 23;
bytes |= fraction;
return *(float*)&bytes;
}
int float32_equal(
float f,
uint32_t sign,
uint32_t exponent,
uint32_t fraction
) {
Float32 float32;
float32 = float32_from_float(f);
return
(float32.sign == sign) &&
(float32.exponent == exponent) &&
(float32.fraction == fraction)
;
}
void float32_print(float f) {
Float32 float32 = float32_from_float(f);
printf(
"%" PRIu32 " %" PRIu32 " %" PRIu32 "\n",
float32.sign, float32.exponent, float32.fraction
);
}
int main(void) {
/* Basic examples. */
assert(float32_equal(0.5f, 0, 126, 0));
assert(float32_equal(1.0f, 0, 127, 0));
assert(float32_equal(2.0f, 0, 128, 0));
assert(isnormal(0.5f));
assert(isnormal(1.0f));
assert(isnormal(2.0f));
/* Quick review of C hex floating point literals. */
assert(0.5f == 0x1.0p-1f);
assert(1.0f == 0x1.0p0f);
assert(2.0f == 0x1.0p1f);
/* Sign bit. */
assert(float32_equal(-0.5f, 1, 126, 0));
assert(float32_equal(-1.0f, 1, 127, 0));
assert(float32_equal(-2.0f, 1, 128, 0));
assert(isnormal(-0.5f));
assert(isnormal(-1.0f));
assert(isnormal(-2.0f));
/* The special case of 0.0 and -0.0. */
assert(float32_equal( 0.0f, 0, 0, 0));
assert(float32_equal(-0.0f, 1, 0, 0));
assert(!isnormal( 0.0f));
assert(!isnormal(-0.0f));
assert(0.0f == -0.0f);
/* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
assert(FLT_MIN == 0x1.0p-126f);
assert(float32_equal(FLT_MIN, 0, 1, 0));
assert(isnormal(FLT_MIN));
/* The largest subnormal number. */
float largest_subnormal = float_from_bytes(0, 0, 0x7FFFFF);
assert(largest_subnormal == 0x0.FFFFFEp-126f);
assert(largest_subnormal < FLT_MIN);
assert(!isnormal(largest_subnormal));
/* The smallest non-zero subnormal number. */
float smallest_subnormal = float_from_bytes(0, 0, 1);
assert(smallest_subnormal == 0x0.000002p-126f);
assert(0.0f < smallest_subnormal);
assert(!isnormal(smallest_subnormal));
return EXIT_SUCCESS;
}
GitHub upstream.
Compile and run with:
gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out
Visualization
It is always a good idea to have a geometric intuition about what we learn, so here goes.
If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:
+---+-------+---------------+
exponent |126| 127 | 128 |
+---+-------+---------------+
| | | |
v v v v
-----------------------------
floats ***** * * * * * * * *
-----------------------------
^ ^ ^ ^
| | | |
0.5 1.0 2.0 4.0
From that we can see that for each exponent:
- there is no overlap between the represented numbers
- for each exponent, we have the same number 2^32 numbers (here represented by 4
*
)
- points are equally spaced for a given exponent
- larger exponents cover larger ranges, but with points more spread out
Now, let's bring that down all the way to exponent 0.
Without subnormals (hypothetical):
+---+---+-------+---------------+
exponent | ? | 0 | 1 | 2 |
+---+---+-------+---------------+
| | | | |
v v v v v
---------------------------------
floats * ***** * * * * * * * *
---------------------------------
^ ^ ^ ^ ^
| | | | |
0 | 2^-126 2^-125 2^-124
|
2^-127
With subnormals:
+-------+-------+---------------+
exponent | 0 | 1 | 2 |
+-------+-------+---------------+
| | | |
v v v v
---------------------------------
floats * * * * * * * * * * * * *
---------------------------------
^ ^ ^ ^ ^
| | | | |
0 | 2^-126 2^-125 2^-124
|
2^-127
By comparing the two graphs, we see that:
subnormals double the length of range of exponent 0
, from [2^-127, 2^-126)
to [0, 2^-126)
The space between floats in subnormal range is the same as for [0, 2^-126)
.
the range [2^-127, 2^-126)
has half the number of points that it would have without subnormals.
Half of those points go to fill the other half of the range.
the range [0, 2^-127)
has some points with subnormals, but none without.
the range [2^-128, 2^-127)
has half the points than [2^-127, 2^-126)
.
This is what we mean when saying that subnormals are a tradeoff between size and precision.
In this setup, we would have an empty gap between 0
and 2^-127
, which is not very elegant.
The interval is well populated however, and contains 2^23
floats like any other.
Implementations
x86_64 implements IEEE 754 directly on hardware, which the C code translates to.
TODO: any notable examples of modern hardware that don't have subnormals?
TODO: does any implementation allow controlling it at runtime?
Subnormals seem to be less fast than normals in certain implementations: Why does changing 0.1f to 0 slow down performance by 10x?
Infinity and NaN
Here is a short runnable example: Ranges of floating point datatype in C?