What is this "denormal data" about ? - C++

Question

I would like to have a broad view about "denormal data" and what it's about because the only thing that I think I got right is the fact that is something especially related to floating point values from a programmer viewpoint and it's related to a general-computing approach from the CPU standpoint .

Someone can decrypt this 2 words for me ?

EDIT

please remember that I'm oriented to C++ applications and only the C++ language.

This might answer your question: http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x/9314926#9314926 — Pubby, Dec 22 '12 at 10:09
See this question for an in-depth discussion of denormals and dealing with them: http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x — fig, Feb 26 '14 at 14:56
Possible duplicate of [What is a subnormal floating point number?](https://stackoverflow.com/questions/8341395/what-is-a-subnormal-floating-point-number) — Ciro Santilli OurBigBook.com, Nov 08 '18 at 08:01
Possible duplicate of [Why does changing 0.1f to 0 slow down performance by 10x?](https://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x) — TobiMcNamobi, Nov 08 '18 at 11:09

Eric Postpischil · Accepted Answer · 2023-06-29T23:33:31.340

You ask about C++, but the specifics of floating-point values and encodings are determined by a floating-point specification, notably IEEE 754, and not by C++. IEEE 754 is by far the most widely used floating-point specification, and I will answer using it.

In IEEE 754, binary floating-point values are encoded with three parts: A sign bit s (0 for positive, 1 for negative), a biased exponent e (the represented exponent plus a fixed offset), and a significand field f (the fraction portion). For normal numbers, these represent exactly the number (−1)^s • 2^e−bias • 1.f, where 1.f is the binary numeral formed by writing the significand bits after “1.”. (For example, if the significand field has the ten bits 0010111011, it represents the significand 1.0010111011₂, which is 1.182617175 or 1211/1024.)

The bias depends on the floating-point format. For 64-bit IEEE 754 binary, the exponent field has 11 bits, and the bias is 1023. When the actual exponent is 0, the encoded exponent field is 1023. Actual exponents of −2, −1, 0, 1, and 2 have encoded exponents of 1021, 1022, 1023, 1024, and 1025. When somebody speaks of the exponent of a subnormal number being zero they mean the encoded exponent is zero. The actual exponent would be less than −1022. For 64-bit, the normal exponent interval is −1022 to 1023 (encoded values 1 to 2046). When the exponent moves outside this interval, special things happen.

Above this exponent interval, floating-point stops representing finite numbers. An encoded exponent of 2047 (all 1 bits) represents infinity (with the significand field set to zero). Below this exponent interval, floating-point changes to subnormal numbers. When the encoded exponent is zero, the significand field represents 0.f instead of 1.f.

There is an important reason for this. If the lowest exponent value were just another normal encoding, then the lower bits of its significand would be too small to represent as a floating-point values by themselves. Without that leading “1.”, there would be no way to say where the first 1 bit was. For example, suppose you had two numbers, both with the lowest exponent, and with significands 1.0010111011₂ and 1.0000000000₂. When you subtract the significands, the result is .0010111011₂. Unfortunately, there is no way to represent this as a normal number. Because you were already at the lowest exponent, you cannot represent the lower exponent that is needed to say where the first 1 is in this result. Since the mathematical result is too small to be represented, a computer would be forced to return the nearest representable number, which would be zero.

This creates the undesirable property in the floating-point system that you can have a != b but a-b == 0. To avoid that, subnormal numbers are used. By using subnormal numbers, we have a special interval where the actual exponent does not decrease, and we can perform arithmetic without creating numbers too small to represent. When the encoded exponent is zero, the actual exponent is the same as when the encoded exponent is one, but the value of the significand changes to 0.f instead of 1.f. When we do this, a != b guarantees that the computed value of a-b is not zero.

Here are the combinations of values in the encodings of 64-bit IEEE 754 binary floating-point:

Sign	Exponent (e)	Significand Bits (f)	Meaning
0	0	0	+zero
0	0	Non-zero	+2⁻¹⁰²²•0.f (subnormal)
0	1 to 2046	Anything	+2^e−1023•1.f (normal)
0	2047	0	+infinity
0	2047	Non-zero but high bit off	+, signaling NaN
0	2047	High bit on	+, quiet NaN
1	0	0	−zero
1	0	Non-zero	−2⁻¹⁰²²•0.f (subnormal)
1	1 to 2046	Anything	−2^e−1023•1.f (normal)
1	2047	0	−infinity
1	2047	Non-zero but high bit off	−, signaling NaN
1	2047	High bit on	−, quiet NaN

Some notes:

+0 and −0 are mathematically equal, but the sign is preserved. Carefully written applications can make use of it in certain special situations.

NaN means “Not a Number”. Commonly, it means some non-mathematical result or other error has occurred, and a calculation should be discarded or redone another way. Generally, an operation with a NaN produces another NaN, thus preserving the information that something has gone wrong. For example, 3 + NaN produces a NaN. A signaling NaN is intended to cause an exception, either to indicate that a program has gone wrong or to allow other software (e.g., a debugger) to perform some special action. A quiet NaN is intended to propagate through to further results, allowing the rest of a large computation to be completed, in the cases where a NaN is only a part of a large set of data and will be handled separately later or will be discarded.

The signs, + and −, are retained with NaNs but have no mathematical value.

In normal programming, you should not be concerned about the floating-point encoding, except to the extent it informs you about the limits and behavior of floating-point calculations. You should not need to do anything special regarding subnormal numbers.

Unfortunately, some processors are broken in that they either violate the IEEE 754 standard by changing subnormal numbers to zero or they perform very slowly when subnormal numbers are used. When programming for such processors, you may seek to avoid using subnormal numbers.

Great answer. I never considered this, but it looks like for a float, inf and NaN are wasting around 2^24 possible values that could have been used. — 2501, Oct 26 '16 at 10:32
@2501: They are not completely wasted. The high bit of the significand field of a NaN is used to determine whether the NaN is signaling or quiet, and the other bits may be used for special purposes, such as debugging. E.g., if you initialize objects to contain NaNs with different significand values and a final result is a NaN when it should be a number, then you can examine the significand field to see where the NaN came from. — Eric Postpischil, Dec 29 '16 at 15:09
@2501: Correct. NaN "payloads" (the low n-1 bits of the mantissa) are very rarely used to carry useful information. It might have been nice if there was some way to use those nearly 2^24 encodings for gradual overflow the same way we have gradual underflow with subnormals, but mantissas are normally already full-range with their implicit leading 1, so it's not obvious how you'd do it without needing a lot of extra hardware (or software) to handle. (e.g. max exponent field means use some mantissa bits as extra exponent bits?) Posit (https://en.wikipedia.org/wiki/Unum_(number_format)) has this — Peter Cordes, Jun 30 '23 at 00:15

Hans Passant · Answer 2 · 2012-12-22T12:09:23.333

To understand de-normal floating point values you first have to understand normal ones. A floating point value has a mantissa and an exponent. In a decimal value, like 1.2345E6, 1.2345 is the mantissa, 6 is the exponent. A nice thing about floating point notation is that you can always write it normalized. Like 0.012345E8 and 0.12345E7 is the same value as 1.2345E6. Or in other words, you can always make the first digit of the mantissa a non-zero number, as long as the value is not zero.

Computers store floating point values in binary, the digits are 0 or 1. So a property of a binary floating point value that is not zero is that it can always be written starting with a 1.

This is a very attractive optimization target. Since the value always starts with 1, there is no point in storing that 1. What is nice about it is that you in effect get an extra bit of precision for free. On a 64-bit double, the mantissa has 52 bits of storage. The actual precision is 53 bits thanks to the implied 1.

We have to talk about the smallest possible floating point value that you can store this way. Doing it in decimal first, if you had a decimal processor with 5 digits of storage in the mantissa and 2 in the exponent then the smallest value it could store that isn't zero is 1.00000E-99. With 1 being the implied digit that isn't stored (doesn't work in decimal but bear with me). So the mantissa stores 00000 and the exponent stores -99. You cannot store a smaller number, the exponent is maxed-out at -99.

Well, you can. You could give up on the normalized representation and forget about the implied digit optimization. You can store it de-normalized. Now you can store 0.1000E-99, or 1.000E-100. All the way down to 0.0001E-99 or 1E-103, the absolute smallest number you can now store.

This is in general desirable, it extends the range of values you can store. Which tends to matter in practical computations, very small numbers are very common in real-world problems like differential analysis.

There's however also a big problem with it, you lose accuracy with de-normalized numbers. The accuracy of floating point calculations is limited by the number of digits you can store. It is intuitive with the fake decimal processor I used as an example, it can only ever compute with 5 significant digits. As long as the value is normalized, you always get 5 significant digits.

But you'll lose digits when you de-normalize. Any value between 0.1000E-99 and 0.9999E-99 has only 4 significant digits. Any value between 0.0100E-99 and 0.0999E-99 has only 3 significant digits. All the way down to 0.0001E-99 and 0.0009E-99, only one significant digit left.

This can greatly reduce the accuracy of the final calculation result. What's worse, it does so in a highly unpredictable manner since these very small de-normalized values tend to show up in a more involved calculation. That's certainly something to worry about, you cannot really trust the end result anymore when it has only 1 significant digit left.

Floating point processors have ways to let you know about this or otherwise sail around the problem. They can for example generate an interrupt or signal when a value becomes de-normalized, letting you interrupt the calculation. And they have a "flush-to-zero" option, a bit in the status word that tells the processor to automatically convert all de-normal values to zero. Which tends to generate infinities, an outcome that tells you that the result is junk and should be discarded.

question: who makes this choices ? As programmer i can declare, assign and use float values, but who manages this decisions about implementation details ? the hardware or the software ( compiler i guess ) ? And based on what ? — user1849534, Dec 22 '12 at 12:18
The implementation details were picked by the chip designer. The way the floating point processor is programmed to deal with de-normals is up to the programmer. Whether or not that's important is up to the algorithm designer that knows the domain. — Hans Passant, Dec 22 '12 at 12:29
can you make an example about an algorithm that cares about this ? — user1849534, Dec 22 '12 at 12:33
No, I'm just a programmer, not a designer of mathematical algorithms. You can find mathematicians at math.stackexchange.com — Hans Passant, Dec 22 '12 at 13:06
You can find some examples here http://www.amath.unc.edu/sysadmin/DOC4.0/common-tools/numerical_comp_guide/ncg_math.doc.html — aka.nice, Dec 22 '12 at 21:51

Ciro Santilli OurBigBook.com · Answer 3 · 2018-11-08T09:26:04.267

IEEE 754 basics

First let's review the basics of IEEE 754 numbers are organized.

Let's focus on single precision (32-bit) first.

The format is:

1 bit: sign
8 bits: exponent
23 bits: fraction

Or if you like pictures:

Source.

The sign is simple: 0 is positive, and 1 is negative, end of story.

The exponent is 8 bits long, and so it ranges from 0 to 255.

The exponent is called biased because it has an offset of -127, e.g.:

  0 == special case: zero or subnormal, explained below
  1 == 2 ^ -126
    ...
125 == 2 ^ -2
126 == 2 ^ -1
127 == 2 ^  0
128 == 2 ^  1
129 == 2 ^  2
    ...
254 == 2 ^ 127
255 == special case: infinity and NaN

The leading bit convention

While designing IEEE 754, engineers noticed that all numbers, except 0.0, have a one 1 in binary as the first digit

E.g.:

25.0   == (binary) 11001 == 1.1001 * 2^4
 0.625 == (binary) 0.101 == 1.01   * 2^-1

both start with that annoying 1. part.

Therefore, it would be wasteful to let that digit take up on precision bit almost every single number.

For this reason, they created the "leading bit convention":

always assume that the number starts with one

But then how to deal with 0.0? Well, they decided to create an exception:

if the exponent is 0
and the fraction is 0
then the number represents plus or minus 0.0

so that the bytes 00 00 00 00 also represent 0.0, which looks good.

If we only considered these rules, then the smallest non-zero number that can be represented would be:

exponent: 0
fraction: 1

which looks something like this in an hex fraction due to the leading bit convention:

1.000002 * 2 ^ (-127)

where .000002 is 22 zeroes with a 1 at the end.

We cannot take fraction = 0, otherwise that number would be 0.0.

But then the engineers, who also had a keen artistic sense, thought: isn't that ugly? That we jump from straight 0.0 to something that is not even a proper power of 2? Couldn't we represent even smaller numbers somehow?

Denormal numbers

The engineers scratched their heads for a while, and came back, as usual, with another good idea. What if we create a new rule:

If the exponent is 0, then:

the leading bit becomes 0

the exponent is fixed to -126 (not -127 as if we didn't have this exception)

Such numbers are called subnormal numbers (or denormal numbers which is synonym).

This rule immediately implies that the number such that:

exponent: 0
fraction: 0

is 0.0, which is kind of elegant as it means one less rule to keep track of.

So 0.0 is actually a subnormal number according to our definition!

With this new rule then, the smallest non-subnormal number is:

exponent: 1 (0 would be subnormal)
fraction: 0

which represents:

1.0 * 2 ^ (-126)

Then, the largest subnormal number is:

exponent: 0
fraction: 0x7FFFFF (23 bits 1)

which equals:

0.FFFFFE * 2 ^ (-126)

where .FFFFFE is once again 23 bits one to the right of the dot.

This is pretty close to the smallest non-subnormal number, which sounds sane.

And the smallest non-zero subnormal number is:

exponent: 0
fraction: 1

which equals:

0.000002 * 2 ^ (-126)

which also looks pretty close to 0.0!

Unable to find any sensible way to represent numbers smaller than that, the engineers were happy, and went back to viewing cat pictures online, or whatever they did in the 70s instead.

As you can see, subnormal numbers do a trade-off between precision and representation length.

As the most extreme example, the smallest non-zero subnormal:

0.000002 * 2 ^ (-126)

has essentially a precision of a single bit instead of 32-bits. For example, if we divide it by two:

0.000002 * 2 ^ (-126) / 2

we actually reach 0.0 exactly!

Runnable C example

Now let's play with some actual code to verify our theory.

In almost all current and desktop machines, C float represents single precision IEEE 754 floating point numbers.

This is in particular the case for my Ubuntu 18.04 amd64 laptop.

With that assumption, all assertions pass on the following program:

subnormal.c

#if __STDC_VERSION__ < 201112L
#error C11 required
#endif

#ifndef __STDC_IEC_559__
#error IEEE 754 not implemented
#endif

#include <assert.h>
#include <float.h> /* FLT_HAS_SUBNORM */
#include <inttypes.h>
#include <math.h> /* isnormal */
#include <stdlib.h>
#include <stdio.h>

#if FLT_HAS_SUBNORM != 1
#error float does not have subnormal numbers
#endif

typedef struct {
    uint32_t sign, exponent, fraction;
} Float32;

Float32 float32_from_float(float f) {
    uint32_t bytes;
    Float32 float32;
    bytes = *(uint32_t*)&f;
    float32.fraction = bytes & 0x007FFFFF;
    bytes >>= 23;
    float32.exponent = bytes & 0x000000FF;
    bytes >>= 8;
    float32.sign = bytes & 0x000000001;
    bytes >>= 1;
    return float32;
}

float float_from_bytes(
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    uint32_t bytes;
    bytes = 0;
    bytes |= sign;
    bytes <<= 8;
    bytes |= exponent;
    bytes <<= 23;
    bytes |= fraction;
    return *(float*)&bytes;
}

int float32_equal(
    float f,
    uint32_t sign,
    uint32_t exponent,
    uint32_t fraction
) {
    Float32 float32;
    float32 = float32_from_float(f);
    return
        (float32.sign     == sign) &&
        (float32.exponent == exponent) &&
        (float32.fraction == fraction)
    ;
}

void float32_print(float f) {
    Float32 float32 = float32_from_float(f);
    printf(
        "%" PRIu32 " %" PRIu32 " %" PRIu32 "\n",
        float32.sign, float32.exponent, float32.fraction
    );
}

int main(void) {
    /* Basic examples. */
    assert(float32_equal(0.5f, 0, 126, 0));
    assert(float32_equal(1.0f, 0, 127, 0));
    assert(float32_equal(2.0f, 0, 128, 0));
    assert(isnormal(0.5f));
    assert(isnormal(1.0f));
    assert(isnormal(2.0f));

    /* Quick review of C hex floating point literals. */
    assert(0.5f == 0x1.0p-1f);
    assert(1.0f == 0x1.0p0f);
    assert(2.0f == 0x1.0p1f);

    /* Sign bit. */
    assert(float32_equal(-0.5f, 1, 126, 0));
    assert(float32_equal(-1.0f, 1, 127, 0));
    assert(float32_equal(-2.0f, 1, 128, 0));
    assert(isnormal(-0.5f));
    assert(isnormal(-1.0f));
    assert(isnormal(-2.0f));

    /* The special case of 0.0 and -0.0. */
    assert(float32_equal( 0.0f, 0, 0, 0));
    assert(float32_equal(-0.0f, 1, 0, 0));
    assert(!isnormal( 0.0f));
    assert(!isnormal(-0.0f));
    assert(0.0f == -0.0f);

    /* ANSI C defines FLT_MIN as the smallest non-subnormal number. */
    assert(FLT_MIN == 0x1.0p-126f);
    assert(float32_equal(FLT_MIN, 0, 1, 0));
    assert(isnormal(FLT_MIN));

    /* The largest subnormal number. */
    float largest_subnormal = float_from_bytes(0, 0, 0x7FFFFF);
    assert(largest_subnormal == 0x0.FFFFFEp-126f);
    assert(largest_subnormal < FLT_MIN);
    assert(!isnormal(largest_subnormal));

    /* The smallest non-zero subnormal number. */
    float smallest_subnormal = float_from_bytes(0, 0, 1);
    assert(smallest_subnormal == 0x0.000002p-126f);
    assert(0.0f < smallest_subnormal);
    assert(!isnormal(smallest_subnormal));

    return EXIT_SUCCESS;
}

GitHub upstream.

Compile and run with:

gcc -ggdb3 -O0 -std=c11 -Wall -Wextra -Wpedantic -Werror -o subnormal.out subnormal.c
./subnormal.out

Visualization

It is always a good idea to have a geometric intuition about what we learn, so here goes.

If we plot IEEE 754 floating point numbers on a line for each given exponent, it looks something like this:

          +---+-------+---------------+
exponent  |126|  127  |      128      |
          +---+-------+---------------+
          |   |       |               |
          v   v       v               v
          -----------------------------
floats    ***** * * * *   *   *   *   *
          -----------------------------
          ^   ^       ^               ^
          |   |       |               |
          0.5 1.0     2.0             4.0

From that we can see that for each exponent:

there is no overlap between the represented numbers
for each exponent, we have the same number 2^32 numbers (here represented by 4 *)
points are equally spaced for a given exponent
larger exponents cover larger ranges, but with points more spread out

Now, let's bring that down all the way to exponent 0.

Without subnormals (hypothetical):

          +---+---+-------+---------------+
exponent  | ? | 0 |   1   |       2       |
          +---+---+-------+---------------+
          |   |   |       |               |
          v   v   v       v               v
          ---------------------------------
floats    *   ***** * * * *   *   *   *   *
          ---------------------------------
          ^   ^   ^       ^               ^
          |   |   |       |               |
          0   |   2^-126  2^-125          2^-124
              |
              2^-127

With subnormals:

          +-------+-------+---------------+
exponent  |   0   |   1   |       2       |
          +-------+-------+---------------+
          |       |       |               |
          v       v       v               v
          ---------------------------------
floats    * * * * * * * * *   *   *   *   *
          ---------------------------------
          ^   ^   ^       ^               ^
          |   |   |       |               |
          0   |   2^-126  2^-125          2^-124
              |
              2^-127

By comparing the two graphs, we see that:

subnormals double the length of range of exponent 0, from [2^-127, 2^-126) to [0, 2^-126)

The space between floats in subnormal range is the same as for [0, 2^-126).
the range [2^-127, 2^-126) has half the number of points that it would have without subnormals.

Half of those points go to fill the other half of the range.
the range [0, 2^-127) has some points with subnormals, but none without.
the range [2^-128, 2^-127) has half the points than [2^-127, 2^-126).

This is what we mean when saying that subnormals are a tradeoff between size and precision.

In this setup, we would have an empty gap between 0 and 2^-127, which is not very elegant.

The interval is well populated however, and contains 2^23 floats like any other.

Implementations

x86_64 implements IEEE 754 directly on hardware, which the C code translates to.

TODO: any notable examples of modern hardware that don't have subnormals?

TODO: does any implementation allow controlling it at runtime?

Subnormals seem to be less fast than normals in certain implementations: Why does changing 0.1f to 0 slow down performance by 10x?

Infinity and NaN

Here is a short runnable example: Ranges of floating point datatype in C?

IIRC, some versions of ARM NEON didn't support subnormals at all for SIMD instructions. I don't know if that's still the case, but I think AArch64 has full IEEE support. (probably with optional flush-to-zero / denormals-are-zero.) — Peter Cordes, Jun 30 '23 at 00:17

score 4 · Answer 4 · answered Dec 22 '12 at 10:12

4

From the IEEE Documentation

If the exponent is all 0s, but the fraction is non-zero (else it would be interpreted as zero), then the value is a denormalized number, which does not have an assumed leading 1 before the binary point. Thus, this represents a number (-1)s × 0.f × 2-126, where s is the sign bit and f is the fraction. For double precision, denormalized numbers are of the form (-1)s × 0.f × 2-1022. From this you can interpret zero as a special type of denormalized number.

answered Dec 22 '12 at 10:12

Rahul Tripathi

168,305
31
280
331

it's good, is bad ... ? If you put 0 as an exponent you should obtain 1, I don't get your explanation, I would like to focus on C++ by the way. – user1849534 Dec 22 '12 at 10:15
2

@user1849534:- You can read this thread:- http://stackoverflow.com/questions/2487653/avoiding-denormal-values-in-c – Rahul Tripathi Dec 22 '12 at 10:17
This is not explanation abut something, it's just a collection of suggestions. – user1849534 Dec 22 '12 at 11:16
Here you have what a denormalized number is, there you can read that (1) you have less precision in denormalized numbers because there's no longer the whole mantissa available, and (2) that they slow down a lot the computations because they are mostly a corner case, and the FPU isn't optimized to handle them fast. What else isn't clear? – Matteo Italia Dec 22 '12 at 12:23
@MatteoItalia not really what I'm asking, for example considering your reply it's not even clear what "denormal values" really are and most importantly who it's in charge about what in an x86 machine. – user1849534 Dec 22 '12 at 12:36
1

@user1849534: how is not clear? Denormalized numbers are numbers where the exponent is zero, and in such a case there's no "implicit one" at the beginning of the mantissa to allow representation of smaller numbers using only a part of the mantissa. Obviously this won't be clear if you don't know how FP numbers work, but understanding how normalized FP numbers work is a prerequisite to understanding denormalized ones. Also, several answers here have also covered the ground of "general introduction to IEEE 754"... – Matteo Italia Dec 22 '12 at 15:56

What is this "denormal data" about ? - C++

4 Answers4

Linked