float.h-like definitions for IEEE 754 binary16 half floats

Question

I'm using half floats as implemented in the SoftFloat library (read: 100% IEEE 754 compliant), and, for the sake of completeness, I wish to provide my code with definitions equivalent to those available in <float.h> for float, double, and long double.

I know there are different flavours of half floats, but I'm just interested in the standardized one by IEEE 754, known as binary16.

From my research, and from my tests, I'm confident to define some of the constants as follows:

#define HALF_MANT_DIG      11
#define HALF_DIG           3
#define HALF_DECIMAL_DIG   5
#define HALF_EPSILON       UINT16_C(0x1400) /* 0.00097656 */
#define HALF_MIN           UINT16_C(0x0400) /* 0.00006103515625 */
#define HALF_MAX           UINT16_C(0x7BFF) /* 65504.0 */

NOTE: epsilon, min, and max are defined as the raw hexadecimal representation of the 16bits taken by the type. The proper way of assigning the raw value to the type depends on the half float library used.

However, for the exponent-related definitions, I wasn't able to find consensus. I have taken a look at the Wikipedia page for binary16, at this other SO question, at the Half library, and at several other code in GitHub and other places.

The proposal linked from that other SO question sounds reputable to me, as well as the Half library and the good news is that they match. However, I found disagreement at the FP16.java implementation, at this implementation, at the Zig language implementation, and at Sargon for D.

#define HALF_MIN_EXP     The article and Half say (-13) but FP16.java and sargon say (-14) 
#define HALF_MAX_EXP     The article and Half say 16 but others say 14 or 15
#define HALF_MIN_10_EXP  The article and Half say (-4) but sargon says (-5)
#define HALF_MAX_10_EXP  The article and Half say 4 but sargon says 5

I'd suppose the article and Half are likely the sources to be right, but, can I know for sure the good values for IEEE 754 binary16?

score 2 · Accepted Answer · answered Aug 29 '22 at 15:33

#define HALF_MANT_DIG 11

Yes, the binary16 format has 11 significant digits (bits). (10 are stored in the primary significand field and 1 is encoded via the exponent field.)

#define HALF_DIG 3

I do not have a reference at hand, so no comment. But this could be tested without too much difficulty.

#define HALF_DECIMAL_DIG 5

IEEE 754-2019 gives this as 1+ceiling(p×log₁₀(2)), where p is the “number of significant bits” in the format, hence 11, so 1+ceiling(11•.3010299957) = 1+ceiling(3.3) = 1+4 = 5.

#define HALF_EPSILON UINT16_C(0x1400) /* 0.00097656 */

Yes, with 11 significand bits, 1 is represented with a high bit of 2⁰ and a low bit of 2⁻¹⁰, which is .0009765625. That is encoded with an exponent bias of 15, so 5 in the exponent field, so 5 << 11, which is 1400₁₆.

#define HALF_MIN UINT16_C(0x0400) /* 0.00006103515625 */

Yes, the minimum normal exponent encoding is 1, removing the bias gives −14, which is .00006103515625, and 1 in the exponent field gives 0400₁₆.

#define HALF_MAX UINT16_C(0x7BFF) /* 65504.0 */

Yes, the maximum normal exponent field is 30, 30 << 11 gives 7800₁₆ and the maximum significand field is 1111111111₂ = 3FF₁₆, and combining them gives 7BFF₁₆. Removing the exponent bias of 15 gives 15, so value represented is 2¹⁵•(2−2⁻¹⁰) = 65,504.

#define HALF_MIN_EXP The article and Half say (-13) but FP16.java and sargon say (-14)
#define HALF_MAX_EXP The article and Half say 16 but others say 14 or 15

C defines the floating-point representation to have the significand digits starting after the radix point, instead of having one before the radix point and the rest after. That is, for a floating-point format with base b, the significand is in [1/b, 1) instead of [1, b). This is visible in the values of *_MIN_EXP and *_MAX_EXP and the behavior of the frexp function, and the exponents are off by one from the more common definition used in IEEE 754.

Per IEEE-754, the exponent range is [−14, 15], so, for the C standard’s scaling, it is [−13, 16].

#define HALF_MIN_10_EXP The article and Half say (-4) but sargon says (-5)

C 2018 5.2.4.2.2 12 says this is ⌈log₁₀b^e_min−1⌉, where e_min is HALF_MIN_EXP, so we have ⌈log₁₀2⁻¹³⁻¹⌉ = ⌈−4.2144…⌉ = −4. And we know from HALF_MIN above that 10⁻⁴ is in the normal range and 10⁻⁵ is not, so −4 is “minimum negative integer such that 10 raised to that power is in the range of normalized floating-point numbers,” which is also in 5.2.4.2.2 12.

#define HALF_MAX_10_EXP The article and Half say 4 but sargon says 5

As above, the C standard gives this as ⌊log₁₀((1−b^{− p})b^e_max)⌋ = ⌊log₁₀((1−2^{− 11})2¹⁶)⌋ = ⌊log₁₀((1−2^{− 11})2¹⁶)⌋ = ⌊log₁₀(65,504)⌋ = ⌊4.8162…⌋ = 4, and 10⁴ is below HALF_MAX but 10⁵ is not.

Thanks a lot for detailing all the checks!! – cesss Aug 29 '22 at 16:20 — cesss, Aug 29 '22 at 16:20

float.h-like definitions for IEEE 754 binary16 half floats

1 Answers1