1

I am learning c++ and I am really trying to get a handle on how numbers are stored and manipulated in memory. To help test my understanding, I wrote the following program:

#include <iostream>
#include <cmath>

int main () {
    int answer = 200*300*400*500;

    std::cout << "int result: " << answer << std::endl;
    std::cout << "Size of int result: " << sizeof(answer) << std::endl;
    std::cout << "Is normal? " << isnormal(answer) << std::endl;

    float f_answer = 200.0*300*400*500;

    std::cout << "float result: " << f_answer << std::endl;
    std::cout << "Size of float result: " << sizeof(f_answer) << std::endl;
    std::cout << "Is normal? " << isnormal(f_answer) << std::endl;

    float under = 0.0000000005 * 0.0000000001 * 0.0000000001 * 0.0000000001 * 0.00001;
    std::cout << "under result: " << under << std::endl;
    std::cout << "Is normal?: " << isnormal(under) << std::endl;

    return 0;
}

Based on everything that I have been learning about IEEE I was expecting the last block to return an underflow value (or 0) because it was my understanding that the most precise that 32bit floats can be is around e^-38.

To my surprise, I was able to get it all the way down to e^-45! This made me start to look into normal vs. subnormal. So, awesome...something, somewhere is figuring out that I want more precision and deciding that this variable should be stored as subnormal. My question is: how?? What is doing this? Is this c++ itself?? Is it the compiler? (I'm using clang, to the best of my knowledge.) I thought the major benefit of using a language like this is that I can be pretty damn sure of what it's doing when I tell it to save as a float. My apologies if this is an ill-formed question, this is truly my first foray into lower level languages and any clarification that folks can provide would be greatly appreciated!

chrisa
  • 11
  • 1
  • 4
    The general principle is that the compiler should take the mathematical number you specified, and give you the closest value that is representable in the relevant type [lex.fcon p3]. If that value is a subnormal, then so be it. – Nate Eldredge Apr 04 '23 at 02:01
  • Nothing in C++ "decides" to make a value subnormal. A subnormal value is a value that has a smaller exponent than the floating point format can represent (the range of representable exponents is finite). They provide a guarantee that two non-equal values cannot be subtracted to produce zero (so can reduce chances of division by zero in long series of calculations). Practically, they have some specialised usages, but are often better avoided due to impacts on accuracy AND (since some hardware produces a trap of some form) performance. Their support by IEEE was controversial because of that. – Peter Apr 04 '23 at 02:20
  • @Peter Thank you for the context! It sounds like it's not c++ that is not "deciding" anything and it's the compiler(?) that sees I'm performing a calculation that requires subnormal precision and then stores it in memory as such. Since they are "often better avoided", I suppose I'm wondering how I can avoid them if I'm performing a more esoteric calculation where it's not immediately obvious that subnormal will be required. Is there a flag in the compiler that can force variables to be saved as normal vs. subnormal, even if saving as normal results in some sort of overflow/underflow error? – chrisa Apr 04 '23 at 02:46
  • There is `-ffast-math` but that breaks a lot of other floating point rules. I don't think it's guaranteed to always eliminate subnormals but it sometimes does. Interestingly, when you use it, you get the same value printed (the compiler optimized out your variable into a constant `double`) but `isnormal()` returns 1. – Nate Eldredge Apr 04 '23 at 03:15
  • Support of subnormal values is determined by the floating point representation (e.g. IEEE), and the compiler design/implementation determines what floating point representation(s) are available. The C++ standard leaves floating point representation as an implementation-defined property. Generally, it is better to focus on algorithm design (e.g. data conditioning, avoiding underflow/overflow, ordering operations for numerical stability, etc). Doing so generally avoids a bunch of low-level concerns - including subnormal values. – Peter Apr 04 '23 at 05:32
  • Simpler question: how does the compiler decide that a value should be stored as a negative number? Answer: the value is a negative number. Don't overthink it. – Pete Becker Apr 04 '23 at 13:38

1 Answers1

1

A number, other than zero, is subnormal if its magnitude is below the normal range of the floating-point format.

The format commonly used for float is IEEE-754 binary32, also called single-precision. In this format, finite numbers are represented as ±2e*f, where e is an integer satisfying −126 ≤ e ≤ 127 and f is the number represented by a 24-bit binary numeral of the form d.ddddddddddddddddddddddd2, where each d represents a bit.

For all the normal numbers, the first d is 1, so the numeral has the form 1.ddddddddddddddddddddddd2, and f satisfies 1 ≤ f < 2. The smallest normal number is +2−126•1.000000000000000000000002, which is approximately 1.1755•10−38. The largest representable finite number is +2127•1.111111111111111111111112, which is approximately 3.40282•1038.

For the subnormal numbers, the first d is 0, and e is −126. The largest subnormal number is +2−126•0.111111111111111111111112, so it is slightly less than the smallest normal number. The smallest subnormal number is +2−126•0.000000000000000000000012, which is approximately 1.40130•10−45.

To encode floating-point representations in 32 bits, the first bit is 0 for + and 1 for −. For normal numbers, the next eight bits are the binary for e+127, so they are a value from 1 to 254, inclusive. The remaining 23 bits are the 23 d bits after the “.”. The first bit for f is known to be 1 when the exponent code is 1 to 254.

For subnormal numbers, the eight bits after the sign bit are 0, and the remaining 23 bits are the 23 d bits after the “.”. The first bit for f is known to be 0 when the exponent code is 0.

The exponent code 255 is used to represent infinities and NaNs (meaning Not a Number).

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312