Bit shifting a half-float into a float

Question

I have no choice but to read in 2 bytes that make up a half-float. I would like to work with this in the form of a 4 byte float. Ive done some research and the only thing I can come up with is bit shifting. My only issues is that I dont fully understand how to grab only a few bits and put them into the float. I have this function, but it does not work.

float ToShortFloat(char v1, char v2) {
    float f = ((v1 << 6) | (0x00) << 3 | (v1 >> 2) | v2 | (0x00) << 13);

    return f;
}

this is the 16 bite (2 byte) structure and this is your typical 32 bit (4 byte) float

If your going to write code for me, please go into good detail about it. I want to understand whats really happening with the bit operators and bit placement.

Explain the purpose of shifting the value `0x00` by _any_ number of bits. The result is still zero. Likewise, bitwise-ORing those zero values is just cluttering up your code with noise that does nothing. Describe exactly what the problem is -- does your function work the way you expect, or not? Have you read the specification for half-float? Are you aware that plenty of battle-tested implementations (including conversions) for this encoding are already out there in the wild? Have you tried a web search for half-float implementations in C++? — paddy, Feb 15 '22 at 01:39
"My only issues is that I dont fully understand how to grab only a few bits and put them into the float." Do you know what operations you want to perform, but don't know how to perform those operations in C++? https://en.wikipedia.org/wiki/Bitwise_operations_in_C — JohnFilleau, Feb 15 '22 at 01:42
Give me a piece of paper with two sets of 8-bit numbers on them (1's and 0's). Describe, in English and excruciating detail, what I need to do to those 1's and 0's to make a float representation. Step by step. Don't leave anything out. That's your algorithm. Get that first, then worry about the code. If you try to code without having your algorithm written you it's going to take you longer. Slow is smooth and smooth is fast. — JohnFilleau, Feb 15 '22 at 01:44
@JohnFilleau I can do exactly that. I can write it out the bit placement easily. I know exactly where I want the bits. In execution, im not sure if im doing it right. Yes, im not sure how to do it correctly in c++. I posted what code I wrote, but it doesnt work. I coud dance around this for a while and still not get it right. — Justin Barren, Feb 15 '22 at 02:06
one of the main problems is that I need to grab the first 6 bits of the first byte, then later get the last 2 bits of the first byte. And this is where im stuck. How do I split up a byte by its bits? And how do I make sure im doing it right? — Justin Barren, Feb 15 '22 at 02:08
Do you have to pass the input as two `char` or could you pass it as `int16_t`? Did you check whether `char` is signed on your compiler? (this makes a difference when shifting to the right). Do you need to handle NaN/Inf cases and subnormals? — chtz, Feb 15 '22 at 02:30
Related: https://stackoverflow.com/a/62418156/6870253 (ignore the `_mm256_i32gather_epi32` part of the answer) — chtz, Feb 15 '22 at 02:32
Btw: The main reason your solution does not work as you want to, is that your bit-operations result in an integer which is then converted into a float, whereas you actually want a `std::bitcast` (before C++20 you have to do a `memcpy`). — chtz, Feb 15 '22 at 02:40
@chtz I have to read it in as 2 separate bytes. Before or after passing it, I would still need to cast it into 16 bit width. Or just move the bits around into something larger. — Justin Barren, Feb 15 '22 at 02:40
I do not know for sure if it should be a signed or unsigned char. Ive used unsigned in other parts of the file, so I am expecting it to be that. — Justin Barren, Feb 15 '22 at 02:49
which specific operations of your algorithm do you not know how to implement in C++? A shift? An or? A masking? — JohnFilleau, Feb 15 '22 at 02:50
To be specific, is `f = (v1 >> 2) << 6` going to to give me the first or the last 6 bits of v1? Will `<< 6` insert all 6 remaining bits into the float? Is it being inserted to the left or right side of the float. I need to add in 0s to fill the full 8 bits of the exponent. I only get 5 bits from the half-float. Will `0x00 << 3` give me 3 0s after my first 6 bits? Then I need the last 2 bits from v1 to make up the begining of the mantissa. Will `(v1 << 6) << 2` get me the last 2 bits of v1? Then I need to get all of v2 and add 13 zeros to the end. — Justin Barren, Feb 15 '22 at 02:59
Instead of trying to bitshift one into the other, why not simply write a function that turns it into an actual number and then just assign that to whatever you need. Its not gonna be efficient, but i feel like trying to hack decimal numbers is a road to madness. — Taekahn, Feb 15 '22 at 03:25
Well sure I can just toss these 2 bytes around all day long. The only thing thats gonna get me good data is the half-float. Which is that data type that I know these bytes to be. I am writing a function that turns 2 hex bytes int a floating point data type. — Justin Barren, Feb 15 '22 at 03:58
I suggest that you start by assigning the result of your bit-operations to an `int` (or `int32_t`) and output that (maybe in [binary](https://stackoverflow.com/questions/7349689)). It looks like you have some misunderstandings how bit-shifts work. Converting the bit-pattern to a `float` is a different task (as mentioned before, this requires a [`std::bit_cast`](https://en.cppreference.com/w/cpp/numeric/bit_cast) or equivalent). — chtz, Feb 15 '22 at 09:34
Some processors have instructions for converting between 16-bit and 32-bit float. Those should be preferred. Failing that, bit-shifting is insufficient to convert a 16-bit float and a 32-bit float (assuming IEEE-754 binary style formats). The 16-bit exponent bias is 15, and the 32-bit exponent bias is 127. So, if the exponent is normal, you must add 112 to its encoding. If it is subnormal, you have to find the leading 1 in the significand encoding, remove it, shift according to where it was, and adjust the exponent to match… — Eric Postpischil, Feb 15 '22 at 10:31
… If the exponent field is the code for infinity or NaN (31), it needs to be updated to 255 instead of just adding 112. — Eric Postpischil, Feb 15 '22 at 10:34

Eric Postpischil · Accepted Answer · 2022-02-15T19:18:54.510

Here is code demonsrating the 16-bit floating-point to 32-bit floating-point conversion plus a test program. The test program requires Clang’s __fp16 type, but the conversion code does not. Handling of NaN payloads and signaling/non-signaling semantics is not tested.

#include <stdint.h>


//  Produce value of bit n.  n must be less than 32.
#define Bit(n)  ((uint32_t) 1 << (n))

//  Create a mask of n bits in the low bits.  n must be less than 32.
#define Mask(n) (Bit(n) - 1)


/*  Convert an IEEE-754 16-bit binary floating-point encoding to an IEEE-754
    32-bit binary floating-point encoding.

    This code has not been tested.
*/
uint32_t Float16ToFloat32(uint16_t x)
{
    /*  Separate the sign encoding (1 bit starting at bit 15), the exponent
        encoding (5 bits starting at bit 10), and the primary significand
        (fraction) encoding (10 bits starting at bit 0).
    */
    uint32_t s = x >> 15;
    uint32_t e = x >> 10 & Mask( 5);
    uint32_t f = x       & Mask(10);

    //  Left-adjust the significand field.
    f <<= 23 - 10;

    //  Switch to handle subnormal numbers, normal numbers, and infinities/NaNs.
    switch (e)
    {
        //  Exponent code is subnormal.
        case 0:
            //  Zero does need any changes, but subnormals need normalization.
            if (f != 0)
            {
                /*  Set the 32-bit exponent code corresponding to the 16-bit
                    subnormal exponent.
                */
                e = 1 + (127 - 15);

                /*  Normalize the significand by shifting until its leading
                    bit moves out of the field.  (This code could benefit from
                    a find-first-set instruction or possibly using a conversion
                    from integer to floating-point to do the normalization.)
                */
                while (f < Bit(23))
                {
                    f <<= 1;
                    e -= 1;
                }

                //  Remove the leading bit.
                f &= Mask(23);
            }
            break;

        // Exponent code is normal.
        default:
            e += 127 - 15;  //  Adjust from 16-bit bias to 32-bit bias.
            break;

        //  Exponent code indicates infinity or NaN.
        case 31:
            e = 255;        //  Set 32-bit exponent code for infinity or NaN.
            break;
    }

    //  Assemble and return the 32-bit encoding.
    return s << 31 | e << 23 | f;
}


#include <inttypes.h>
#include <math.h>
#include <stdio.h>
#include <stdlib.h>


int main(void)
{
    //  Use unions so we can iterate and manipulate the encodings.
    union { uint16_t enc; __fp16 value; } x;
    union { uint32_t enc; float  value; } y;

    //  Iterate through all 16-bit encodings.
    for (uint32_t i = 0; i < Bit(16); ++i)
    {
        x.enc = i;
        y.enc = Float16ToFloat32(x.enc);
        if (isnan(x.value) != isnan(y.value) ||
            !isnan(x.value) && x.value != y.value)
        {
            printf("Failure:\n");
            printf("\tx encoding = 0x%04" PRIx16 ",     value = %.99g.\n",
                x.enc, x.value);
            printf("\ty encoding = 0x%08" PRIx32 ", value = %.99g.\n",
                y.enc, y.value);
            exit(EXIT_FAILURE);
        }
    }
}

As chtz points out, we can using 32-bit floating-point arithmetic to handle the scaling adjustment for both normal and subnormal values. To do this, replace the code in Float16ToFloat32 after f <<= 23 - 10; with:

    //  For infinities and NaNs, set 32-bit exponent code.
    if (e == 31)
        return s << 31 | 255 << 23 | f;

    /*  For finite values, reassemble with shifted fields and using a
        floating-point multiply to adjust for the changed exponent bias.
    */
    union { uint32_t enc; float  value; } y = { .enc = s << 31 | e << 23 | f };
    y.value *= 0x1p112f;
    return y.enc;

Instead of adding 112 to the exponent, you could multiply by `2**112`, which would take care of the subnormals and normals (infinity and NaNs still require special handling). — chtz, Feb 15 '22 at 17:11

score 0 · Answer 2 · answered May 02 '22 at 08:38

Although this question has been answered with a correct implementation, you can do the conversion a lot faster. Here much faster IEEE-754 FP32<->FP16 conversion algorithms are provided, without any loop or branching. These handle normal and denormal numbers and ditch NaN/Inf for double the range.

Bit shifting a half-float into a float

2 Answers2

Linked