-3

When I was looking the following code, I was not able to understand, where does the 127, 16 and 23 come from? I know the bit representation of 127, 16, and the shift operation, but couldn't put them together.

const FP32 f16max = { (127 + 16) << 23 };

This comes from line 358 of https://eigen.tuxfamily.org/dox/Half_8h_source.html

I know this means 1.0:

0011 1111 1000 0000 0000 0000 0000 000

The 127 must be the 011 1111 1, left shift 23 is to remove all the mantissa. what does this 16 do here?

Zack
  • 1,205
  • 2
  • 14
  • 38

2 Answers2

1

That code forms the bit representation of a single-precision floating point number, with the value 65536.0.

In the single-precision format, the lower 23 bits are the fractional part of the mantissa, and the next 8 bits are the exponent plus 127. So (127 + 16) << 23 represents the number 1.0 * 216 = 65536.0, which is a bit more than the maximum possible half-precision floating point number.

interjay
  • 107,303
  • 21
  • 270
  • 254
  • I am still confused about the bits and how that work to get 65536. – Zack Jun 24 '19 at 17:01
  • It's simply the way to encode a floating point number with exponent 16 and mantissa 1.0. The fractional part of the mantissa is 0 so the lower 23 bits are 0. And the next bits are the exponent plus 127. – interjay Jun 24 '19 at 17:17
  • I was still not able to understand. Could you write down the actually bit representation? – Zack Jun 24 '19 at 18:03
  • 2
    0 / 10001111 / 00000000000000000000000. The slashes are the dividers between sign/exponent/mantissa. – interjay Jun 24 '19 at 19:29
1

An IEEE 754 single-precision binary floating-point number has the following representation:

Example of a floating point number

Here the exponent is an 8-bit unsigned integer from 0 to 255 with a bias of 127. Or, you can say it is a signed integer from −128 to 127. The number is then decoded as:

(−1)b31 (1 + Sum(b23−i 2i; i = 22 … 0 )) × 2e − 127

The line you mention uses the data-type float32_bits which is defined as:

union float32_bits {
   unsigned int u;
   float f;
};

So since float32_bits is a union, the integer u and float f occupy the same memory space. That is why when you see a notation as:

const float32_bits f16max = { (127 + 16) << 23 };

you should understand it as assigning a bit-pattern to a float. With the above explanation, you see that 127 is nothing more than the bias compensation in the above formula and the 23 is the shift to move the bits into the exponent part of the float point number.

So the variable f16max represents 216 as a floating point number f16max.f and 143 · 223 as the unsigned integer f16max.u.

Interesting reads:

Image taken from Wikipedia: Single-precision floating-point format

kvantour
  • 25,269
  • 4
  • 47
  • 72