Save float16 max number in float32

Question

How to save the float16 (https://en.wikipedia.org/wiki/Half-precision_floating-point_format) max number in float32 (https://en.wikipedia.org/wiki/Single-precision_floating-point_format) format?

I want to have a function which could convert 0x7bff to 65504. 0x7bff is the max value can be represented by floating point half precision:

0 11110 1111111111 -> decimal value: 65504

I want to have 0x7bff to represent the actual bits in my program.

float fp16_max = bit_cast(0x7bff); 
# want "std::cout << fp16_max" to be 65504

I tried to implement such a function but it didn't seem to work:

float bit_cast (uint32_t fp16_bits) {
    float i;
    memcpy(&i, &fp16_bits, 4);
    return i; 
}    
float test = bit_cast(0x7bff);
# print out test: 4.44814e-41

I'm not sure if this is possible without manually calculating mantissa and exponent and putting that into float. If you just put bits into float, you'll get different result, because particular bits will have different meaning in single float vs. half float. — Yksisarvinen, Jul 11 '19 at 18:09
Don't use `memcpy`, instead re-assign. The floating point formats are wildly different. — tadman, Jul 11 '19 at 18:11
Possible duplicate of [32-bit to 16-bit Floating Point Conversion](https://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion) — Botje, Jul 11 '19 at 18:13
I'm pretty sure @tadman is spot on. Anything you try to do to circumvent that assignment is likely to slow it down or corrupt it. — Ted Lyngmo, Jul 11 '19 at 18:20
Side note: "_it doesn't seem to return the correct value_ What value are you referring to? I don't see any return checks. — Ted Lyngmo, Jul 11 '19 at 18:24
I noticed the edit. `float16` and `float32` are types/aliases that are not standardized. Casting any type of float* into a bitpattern will need some serious thinking. What's the format of the float you put into the buffer? Is the receiver ready for it? If you manage to cast a float to a double via a pointer on your platform, it is most likely non-conformant - or you use values that makes the tests happy. Either way, not a good idea. — Ted Lyngmo, Jul 11 '19 at 18:33
I noticed the new edit. Note that IEEE 754-2008 is not mandated by the C++ standard, so, it's still a *bad idea*. — Ted Lyngmo, Jul 11 '19 at 18:36
Currently, the proper way to serialize floats is to agree with the receiver how they will be presented. If you do it internally, you might get away with it, but you loose/prevent optimization. Do the assignment as a normal assignment. — Ted Lyngmo, Jul 11 '19 at 18:40
"_I want to have a function which could convert 0x7bff (max value of float16) to 65504 and save it in float._" - Sure, but then you are already doing it. Have you had any problems with it? — Ted Lyngmo, Jul 11 '19 at 18:44
Thanks for your help. If I do float a = 0x7bff and print it out, it's not 65504. That's why I need a conversion here. — Zack, Jul 11 '19 at 18:50
@Lemon I'm not sure but I think you are mixing things up. I do it too so this might not be precise: `float a` is your `lvalue` receiving the result of `0x7bff` which is an integer. This literal will be translated *by the compiler* to the best it can (31743 point something), to something that fits your `float`. `float`s do not manage to represent every integer. If I take a step back and look at your question and see "save". Do you need this saved data to go anywhere? Does it only need to be interpreted on the same computer on which you saved it? If so, it's sort of simple. — Ted Lyngmo, Jul 11 '19 at 18:57
Sorry for the confusion. It's happening on the same computer without going anywhere. — Zack, Jul 11 '19 at 19:00
Then, brute force it. `float little = ...; double big = ...; .write(&little, sizeof(little)); .write(&big, sizeof(big));` — Ted Lyngmo, Jul 11 '19 at 19:06
I noticed `65504` as some sort of precondition. Do you know _trap representations_? I don't know them by heart, but what is the purpose of the magic decimal 65504 value? — Ted Lyngmo, Jul 11 '19 at 19:09
This is the max value can be represented by fp16 in bit pattern: 0 11110 1111111111, it gets converted to 65504 in decimal. It's in this link: https://en.wikipedia.org/wiki/Half-precision_floating-point_format — Zack, Jul 11 '19 at 19:12
@Botje: How do you figure a question asking to convert a binary16 to a binary32 is a duplicate of a question asking to convert a binary32 to a binary16? — Eric Postpischil, Jul 11 '19 at 19:24
@Botje: Sure they do… if a person wants to root through other text and undocumented code. That’s not a suitable qualification to serve as a duplicate. — Eric Postpischil, Jul 11 '19 at 20:12

score 3 · Accepted Answer · answered Jul 11 '19 at 19:21

#include <cmath>
#include <cstdio>


/*  Decode the IEEE-754 binary16 encoding into a floating-point value.
    Details of NaNs are not handled.
*/
static float InterpretAsBinary16(unsigned Bits)
{
    //  Extract the fields from the binary16 encoding.
    unsigned SignCode        = Bits >> 15;
    unsigned ExponentCode    = Bits >> 10 & 0x1f;
    unsigned SignificandCode = Bits       & 0x3ff;

    //  Interpret the sign bit.
    float Sign = SignCode ? -1 : +1;

    //  Partition into cases based on exponent code.

    float Significand, Exponent;

    //  An exponent code of all ones denotes infinity or a NaN.
    if (ExponentCode == 0x1f)
        return Sign * (SignificandCode == 0 ? INFINITY : NAN);

    //  An exponent code of all zeros denotes zero or a subnormal.
    else if (ExponentCode == 0)
    {
        /*  Subnormal significands have a leading zero, and the exponent is the
            same as if the exponent code were 1.
        */
        Significand = 0 + SignificandCode * 0x1p-10;
        Exponent    = 1 - 0xf;
    }

    //  Other exponent codes denote normal numbers.
    else
    {
        /*  Normal significands have a leading one, and the exponent is biased
            by 0xf.
        */
        Significand = 1 + SignificandCode * 0x1p-10;
        Exponent    = ExponentCode - 0xf;
    }

    //  Combine the sign, significand, and exponent, and return the result.
    return Sign * std::ldexp(Significand, Exponent);
}


int main(void)
{
    unsigned Bits = 0x7bff;
    std::printf(
        "Interpreting the bits 0x%x as an IEEE-754 binary16 yields %.99g.\n",
        Bits,
        InterpretAsBinary16(Bits));
}

score 1 · Answer 2 · answered Jul 11 '19 at 18:35

By the very declaration float fp16_max, your value is already a 32-bit float; no need to cast here. I guess you can simply:

float i = fp16_max;

The assumption here is that your "magic" bit_cast function already returned a 32-bit float properly. Since you haven't shown us what bit-cast does or actually returns, I'll assume it does indeed return a proper float value.

score 1 · Answer 3 · edited Jun 20 '20 at 09:12

1

How to save the float16 max number in float32 format?

65504

You can simply convert the integer to float:

float half_max = 65504;

If you would like to calculate the value, you can use ldexpf:

float half_max = (2 - ldexpf(1, -10)) * ldexpf(1, 15)

Or generally, for any IEEE float:

// in case of half float
int bits = 16;
int man_bits = 10;

// the calculation
int exp_bits = bits - man_bits - 1;
int exp_max = (1 << (exp_bits - 1)) - 1;
long double max = (2 - ldexp(1, -1 * man_bits)) * ldexp(1, exp_max);

Bit casting 0x7bff does not work, because 0x7bff is the representation in the binary16 format (in some endianness), not in binary32 format. You cannot bit cast conflicting representations.

edited Jun 20 '20 at 09:12

Community

1
1

answered Jul 11 '19 at 18:44

eerorika

232,697
12
197
326

Cool. I really want to keep 0x7bff in my program. Is there anyway to implement the bit_conversion function using ldexpf? – Zack Jul 11 '19 at 19:07
@Lemon you can write a function that converts a half float number to another representation (such as `float`). Then `std::memcpy` 0x7bff to the storage used to represent the half float (taking care of endianness). Then use your function to convert to `float`. – eerorika Jul 11 '19 at 19:10

Save float16 max number in float32

3 Answers3