0

How can I convert a float (float32) to a half (float16) and the other way around in C while accounting for edge cases like NaN, Infinity etc.

I don't need arithmetic because I just need the types in order to fulfill the requirement of supporting them. So the half type can just be a uint16_t or corresponding typedef. I only found ways to do it in C++ or some that didn't account for edge cases like NaN.

I need to convert a float into a half type that can be represented as a simple uint16_t, this uint16_t should just contain the binary representation of the half since I won't do arithmetic on it. I require this so I can fulfill a library requirement. I can't use existing implementations because they are build as shared libraries (also mostly for C++) which I can't use in this case. Also the GCC/Clang __fp16 and _Float16 won't work because I compile my code very to web assembly that will run in an isolated environment and as such can't use native dependent code (No WASI) (And EMCC throws an error when using _FloatXX types).

user16217248
  • 3,119
  • 19
  • 19
  • 37
juffma
  • 39
  • 5
  • 1
    Well, what problem do you have? [single](https://en.wikipedia.org/wiki/Single-precision_floating-point_format) [half](https://en.wikipedia.org/wiki/Half-precision_floating-point_format) – ikegami Jul 30 '23 at 18:32
  • @ikegami i don't know how to do the conversion and don't trust AI enough to do the code for me – juffma Jul 30 '23 at 18:36
  • add some `if`s then :) – 0___________ Jul 30 '23 at 19:04
  • 1
    There is a significant difference between integers and floats. Please give us some examples – 0___________ Jul 30 '23 at 19:05
  • @0___________, They want to convert a IEEE single to a IEEE half. The arch uses `float` for IEEE single, and it doesn't have a type for IEEE half, so `uint16_t` is the obvious/desired type to use for the produced value. Obviously, the object couldn't be used as an integer. – ikegami Jul 30 '23 at 19:10
  • @ikegami *"So the half type can just be a `uint16_t` or corresponding typedef"* What aout this? – 0___________ Jul 30 '23 at 19:17
  • @ikegami is it "falf type" of `float` The question is very unclear and I think that OP doesn't know what he wants to achieve too – 0___________ Jul 30 '23 at 19:18
  • @0___________ See the link in my first comment. IEEE half-precision floating point numbers are 16-bit floating point numbers. – ikegami Jul 30 '23 at 19:19
  • @ikegami are they inter types? (rethoric question) – 0___________ Jul 30 '23 at 19:20
  • @0___________ No. But that's not relevant. There's no intention to use them as integers. Are you replying without reading? – ikegami Jul 30 '23 at 19:20
  • @ikegami so why were they mentioned? – 0___________ Jul 30 '23 at 19:21
  • @0___________, They want to convert a IEEE single to a IEEE half. The arch uses float for IEEE single, and it doesn't have a type for IEEE half, so uint16_t is the obvious/desired type to use for the produced value. Obviously, the object couldn't be used as an integer. – ikegami Jul 30 '23 at 19:21
  • @0___________, You can use a `uint16_t` to store bit patterns that don't correspond to numbers, including IEEE half precision numbers. – ikegami Jul 30 '23 at 19:22
  • @ikegami it is only your assumption. OP has t clarify the question – 0___________ Jul 30 '23 at 19:22
  • 2
    @0___________, No, that's what they said. You seem to have problems understanding everyone. Maybe you're foggy at the moment, but that's not a problem with the OP. "the type for the half can just be a `uint16_t`" would be better than "the half type can just be a `uint16_t`", but that's hardly unclear. – ikegami Jul 30 '23 at 19:25
  • You probably should clarify if you mean bf16, fp16, or something else. – Simon Goater Jul 30 '23 at 20:38
  • Recent gcc versions support `_Float16` on a couple of popular architectures, making it easy if non-portable. I can provide an example if this gets reopened. – Shawn Jul 30 '23 at 23:44
  • 1
    @0___________: It is quite clear. OP says the float16 type can be `uint16_t` because “I don't need arithmetic”. The only float16 operations they need are converting to and from float32; they do not need float16 arithmetic like addition, multiplication, and so on. The float16 format is just for storage, not calculation. That means they only need something to hold the bit pattern of a float16, which a `uint16_t` is fine for. – Eric Postpischil Jul 30 '23 at 23:56
  • @Shawn That sadly wouldn't work, see my edit. – juffma Jul 31 '23 at 08:51
  • @0___________ ikegami and Eric Postpischil are correct, i only wan't to use uint16_t as a container for the binary representation. – juffma Jul 31 '23 at 08:53
  • C code for 16-bit to 32-bit is [here](https://stackoverflow.com/a/71125539/298225). An algorithm for 32-bit to 16-bit is [here](https://stackoverflow.com/a/16171497/298225). – Eric Postpischil Jul 31 '23 at 23:55

3 Answers3

2

You have a number of options:

  1. Use an existing impl e.g. the one from Industrial Light & Magic. There are a number of other impl as well.
  2. Use an intrinsic e.g. for Intel CPU's you have _mm_cvtps_ph and _mm_cvtph_ps which can convert upto 4 values at a time.
  3. Write your own using knowledge of the IEEE 754 float format and that used by half.

Edit: Since you mainly just want to convert back-and-forth, the two functions to look at in the ILM code are: Float to half: Line 85 and half to float: line 62

Chris Kushnir
  • 227
  • 2
  • 4
  • The first one wouldn't work because it's using C++ which in my usecase isn't optimal. the second option isn't possible either (see my edit). And as stated in this edit i'm not smart enough to understand IEEE notation as stated in this link. – juffma Jul 31 '23 at 08:49
  • 1
    If you look at line 221 in the ILM header link I posted you'll see they store the half in an unsigned short, which is what you asked to do. The fact that they wrap that in a class shouldn't be a big deal, just convert the methods to c functions i.e. all the methods that use _h would now take it as a param or return it when converted to a method. – Chris Kushnir Jul 31 '23 at 14:40
0

Code for the 16-bit to 32-bit conversion is here.

Below is quickly written tested code for 32-bit to 16-bit conversion, largely based on the algorithm here. I do not have time to document it properly at the moment, and it can be improved. The tests do not check handling of the payload bits (in the significand field) of NaNs.

#include <inttypes.h>
#include <stdint.h>
#include <stdio.h>


//  Provide bit N (equal to 2**N), for 0 <= N < 32.
#define Bit(N)  ((uint32_t) 1 << (N))

//  Provide a mask of N bits, for 0 <= N < 32.
#define Mask(N) (((uint32_t) 1 << (N)) - 1)


/*  Convert a 32-bit floating-point number (with 1 sign bit, 8 exponent field
    bits, and 23 significand field bits) to a 16-bit floating-point number
    (with 1 sign bit, 5 exponent field bits, and 10 significand field bits).
*/
uint16_t ConvertF32ToF16(uint32_t x)
{
    uint32_t SignBit          = x >> 31;
    uint32_t ExponentField    = x >> 23 & Mask( 8);
    uint32_t SignificandField = x       & Mask(23);

    //  All-bits-set in the exponent field means infinity or NaN.
    if (ExponentField == Mask(8))
    {
        //  Zero significand field means infinity.
        if (SignificandField == 0)
            //  Return infinity with copied sign bit.
            return SignBit << 15 | Mask(5) << 10;

        //  Otherwise, we have a NaN.
        else
        {
            //  Truncate significand field.
            SignificandField >>= 23-10;

            /*  If truncated field is zero, turn on a bit so it indicates a
                NaN, not an infinity.
            */
            if (SignificandField == 0)
                SignificandField = 1;

            return SignBit << 15 | Mask(5) << 10 | SignificandField;
        }
    }

    //  All-bits-zero in the exponent field indicates a subnormal number.
    else if (ExponentField == 0)
    {
        /*  All subnormal 32-bit floating-point numbers are too small to
            convert to anything but zero in the 16-bit format.  Return zero
            with the sign bit copied.
        */
        return SignBit << 15;
    }

    //  Everything else is a normal number.
    else
    {
        //  Convert the exponent from the 32-bit format to the 16-bit format.
        int Exponent = (int) ExponentField - 127 + 15;

        //  Small numbers convert to zero.
        if (Exponent < -11)
            return SignBit << 15;

        /*  Decode the significand field.  Note the radix point is between
            bits 23 and 22.
        */
        uint32_t Significand = Bit(23) | SignificandField;

        //  Process values that are subnormal in the 16-bit format.
        if (Exponent < 1)
        {
            /*  Below, we shift the significand to where it will be when the
                exponent is increased to the minimum exponent of the 16-bit
                format.  Here, we move the bits that will shift out to the
                left, to isolate them.
            */
            uint32_t T = Significand << 32 - (1 - Exponent) - 13;
            Significand >>= 1 - Exponent + 13;
            Exponent = 1;

            /*  If the removed bits are greater than half the low bit, we round
                up.
            */
            if (Bit(31) < T)
                ++Significand;

            /*  Otherwise, if the removed bits equal half the low bit, we round
                to make the low bit of the significand even.
            */
            if (Bit(31) == T)
                Significand += Significand & 1;

            //  That could carry to significand to 2.
            if (Bit(10) <= Significand)
                return SignBit << 15 | 1 << 10 | 0;

            return SignBit << 15 | 0 << 10 | (Significand & Mask(10));
        }

        uint32_t T = Significand & Mask(13);
        if (Bit(12) < T || Bit(12) == T && (Significand & Bit(13)))
            Significand += Bit(13);
        Significand >>= 13;
        if (Bit(11) <= Significand)
        {
            ++Exponent;
            Significand >>= 1;
        }

        if (31 <= Exponent)
            return SignBit << 15 | Mask(5) << 10;

        return SignBit << 15 | Exponent << 10 | (Significand & Mask(10));
    }
}


#include <stdlib.h>


static void Test(float x)
{
    uint16_t y0 = (union { _Float16 f; uint16_t u; }) { x } .u;

    uint32_t xu = (union { float f; uint32_t u; }) { x } .u;
    uint16_t y1 = ConvertF32ToF16(xu);

    if (y0 == y1) return;

    printf("%a -> 0x%04hx but expected 0x%04hx.\n", x, y1, y0);
    exit(EXIT_FAILURE);
}


#include <math.h>


int main(void)
{
    Test(-NAN);
    Test(+NAN);
    for (float x = -INFINITY; x < INFINITY; x = nexttowardf(x, INFINITY))
        Test(x);
    Test(INFINITY);
}
Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
0

Below I am showing an ISO-C99 implementation of float to half conversion that has been tested exhaustively. The following assumptions apply: float maps to IEEE-754 binary32 format while half maps to IEEE-754 binary16 format; both floating-point and integer data types use the same endianness when stored; conversion to a narrower floating-point type should utilize the rounding mode to-nearest-or-even.

As a golden reference the test framework uses the x86-64 instruction set extension F16C, introduced in 2011 to support half precision (FP16) as a storage type. IEEE-754 NaN handling contains some architecture specific elements, and the float2half_rn() function below was designed to mimic the x86-64 behavior. Adjustments, for example switching to the use of a single canonical NaN encoding, are trivial.

The code below is derived from code that I previously published under a BSD license here. I used the Intel Compiler Version 13.1.3.198 Build 20130607 to build this code and ran the exhaustive test on an IvyBridge CPU.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <string.h>
#include "immintrin.h"

uint32_t float_as_uint32 (float a)
{
    uint32_t r;
    memcpy (&r, &a, sizeof r);
    return r;
}

uint16_t float2half_rn (float a)
{
    uint32_t ia = float_as_uint32 (a);
    uint16_t ir;

    ir = (ia >> 16) & 0x8000;
    if ((ia & 0x7f800000) == 0x7f800000) {
        if ((ia & 0x7fffffff) == 0x7f800000) {
            ir |= 0x7c00; /* infinity */
        } else {
            ir |= 0x7e00 | ((ia >> (24 - 11)) & 0x1ff); /* NaN, quietened */
        }
    } else if ((ia & 0x7f800000) >= 0x33000000) {
        int shift = (int)((ia >> 23) & 0xff) - 127;
        if (shift > 15) {
            ir |= 0x7c00; /* infinity */
        } else {
            ia = (ia & 0x007fffff) | 0x00800000; /* extract mantissa */
            if (shift < -14) { /* denormal */  
                ir |= ia >> (-1 - shift);
                ia = ia << (32 - (-1 - shift));
            } else { /* normal */
                ir |= ia >> (24 - 11);
                ia = ia << (32 - (24 - 11));
                ir = ir + ((14 + shift) << 10);
            }
            /* IEEE-754 round to nearest of even */
            if ((ia > 0x80000000) || ((ia == 0x80000000) && (ir & 1))) {
                ir++;
            }
        }
    }
    return ir;
}

uint16_t float2half_rn_ref (float a)
{
    __m128 pa = _mm_set_ps1 (a);
    __m128i r16 = _mm_cvtps_ph (pa, _MM_FROUND_TO_NEAREST_INT);
    uint16_t res;
    memcpy (&res, &r16, sizeof res);
    return res;
}

float uint32_as_float (uint32_t a)
{
    float r;
    memcpy (&r, &a, sizeof r);
    return r;
}

int main (void)
{
    float arg;
    uint16_t resi, refi;
    uint32_t argi = 0;
    do {
        arg = uint32_as_float (argi);
        refi = float2half_rn_ref (arg);
        resi = float2half_rn (arg);
        if (resi != refi) {
            printf ("error @ %15.8e (%08x): resi=%04x  refi=%04x\n", 
                    arg, argi, resi, refi);
            return EXIT_FAILURE;
        }
        argi++;
        if ((argi & 0xffffff) == 0) printf ("\r%08x", argi);
    } while (argi);
    return EXIT_SUCCESS;
}

njuffa
  • 23,970
  • 4
  • 78
  • 130