Convert int to 16bit float (half precision floating point) in c++

Question

How can I convert an integer to a half precision float (which is to be stored into an array unsigned char[2]). The range to the input int will be from 1-65535. Precision is really not a concern.

I am doing something similar for converting to 16bit int into an unsigned char[2], but I understand there is not half precision float C++ datatype. Example of this below:

int16_t position16int = (int16_t)data;
memcpy(&dataArray, &position16int, 2);

Maybe relevant : http://gamedev.stackexchange.com/a/17410/9333 — slaphappy, Oct 10 '12 at 10:49
With the tip from Goz maybe of help: [32-bit to 16-bit Floating Point Conversion](http://stackoverflow.com/q/1659440/237483) — Christian Ammer, Oct 10 '12 at 11:35

score 5 · Accepted Answer · answered Oct 10 '12 at 12:15

It's a very straightforward thing, all the info you need is in Wikipedia.

Sample implementation:

#include <stdio.h>

unsigned int2hfloat(int x)
{
  unsigned sign = x < 0;
  unsigned absx = ((unsigned)x ^ -sign) + sign; // safe abs(x)
  unsigned tmp = absx, manbits = 0;
  int exp = 0, truncated = 0;

  // calculate the number of bits needed for the mantissa
  while (tmp)
  {
    tmp >>= 1;
    manbits++;
  }

  // half-precision floats have 11 bits in the mantissa.
  // truncate the excess or insert the lacking 0s until there are 11.
  if (manbits)
  {
    exp = 10; // exp bias because 1.0 is at bit position 10
    while (manbits > 11)
    {
      truncated |= absx & 1;
      absx >>= 1;
      manbits--;
      exp++;
    }
    while (manbits < 11)
    {
      absx <<= 1;
      manbits++;
      exp--;
    }
  }

  if (exp + truncated > 15)
  {
    // absx was too big, force it to +/- infinity
    exp = 31; // special infinity value
    absx = 0;
  }
  else if (manbits)
  {
    // normal case, absx > 0
    exp += 15; // bias the exponent
  }

  return (sign << 15) | ((unsigned)exp << 10) | (absx & ((1u<<10)-1));
}

int main(void)
{
  printf(" 0: 0x%04X\n", int2hfloat(0));
  printf("-1: 0x%04X\n", int2hfloat(-1));
  printf("+1: 0x%04X\n", int2hfloat(+1));
  printf("-2: 0x%04X\n", int2hfloat(-2));
  printf("+2: 0x%04X\n", int2hfloat(+2));
  printf("-3: 0x%04X\n", int2hfloat(-3));
  printf("+3: 0x%04X\n", int2hfloat(+3));
  printf("-2047: 0x%04X\n", int2hfloat(-2047));
  printf("+2047: 0x%04X\n", int2hfloat(+2047));
  printf("-2048: 0x%04X\n", int2hfloat(-2048));
  printf("+2048: 0x%04X\n", int2hfloat(+2048));
  printf("-2049: 0x%04X\n", int2hfloat(-2049)); // first inexact integer
  printf("+2049: 0x%04X\n", int2hfloat(+2049));
  printf("-2050: 0x%04X\n", int2hfloat(-2050));
  printf("+2050: 0x%04X\n", int2hfloat(+2050));
  printf("-32752: 0x%04X\n", int2hfloat(-32752));
  printf("+32752: 0x%04X\n", int2hfloat(+32752));
  printf("-32768: 0x%04X\n", int2hfloat(-32768));
  printf("+32768: 0x%04X\n", int2hfloat(+32768));
  printf("-65504: 0x%04X\n", int2hfloat(-65504)); // legal maximum
  printf("+65504: 0x%04X\n", int2hfloat(+65504));
  printf("-65505: 0x%04X\n", int2hfloat(-65505)); // infinity from here on
  printf("+65505: 0x%04X\n", int2hfloat(+65505));
  printf("-65535: 0x%04X\n", int2hfloat(-65535));
  printf("+65535: 0x%04X\n", int2hfloat(+65535));
  return 0;
}

Output (ideone):

 0: 0x0000
-1: 0xBC00
+1: 0x3C00
-2: 0xC000
+2: 0x4000
-3: 0xC200
+3: 0x4200
-2047: 0xE7FF
+2047: 0x67FF
-2048: 0xE800
+2048: 0x6800
-2049: 0xE800
+2049: 0x6800
-2050: 0xE801
+2050: 0x6801
-32752: 0xF7FF
+32752: 0x77FF
-32768: 0xF800
+32768: 0x7800
-65504: 0xFBFF
+65504: 0x7BFF
-65505: 0xFC00
+65505: 0x7C00
-65535: 0xFC00
+65535: 0x7C00

It should be noted that this code has abnormal rounding behavior. Largely, it truncates rather than rounds to nearest (which is more common) but is anomalous at the upper end. Inputs greater than the maximum representable finite value are converted to infinity, rather than being truncated like other inputs, even if they are only slightly greater than the maximum. For example 0xfff (4095) is converted to 0x6bff (4094), but 0xfff0 (65520) or 0xffe1 (65505) is converted to 0x7c00 (infinity) rather than 0x7bff (65504). — Eric Postpischil, Oct 10 '12 at 13:58
@EricPostpischil You're right. But that was my interpretation of `Precision is really not a concern`. — Alexey Frunze, Oct 10 '12 at 14:01
@EricPostpischil How can I implement the same function using round-to-even method? — envy grunt, Dec 17 '19 at 20:01

score 2 · Answer 2 · edited May 23 '17 at 12:17

I asked the question of how to convert 32-bit floating points to 16-bit floating point.

Float32 to Float16

So from that you could very easily convert the int to a float and then use the question above to create a 16-bit float. I would suggest this is probably much easier than going from int directly to 16-bit float. Effectively by converting to 32-bit float you have done most of the hardwork and then you just need to shift a few bits around.

Edit: Looking at Alexey's excellent answer I think its highly likely that using a hardware int to float conversion and then bit shifting it around is likely to be a fair bit faster than his method. Might be worth profiling both methods and comparing them.

score 0 · Answer 3 · edited Apr 13 '17 at 12:18

0

Following @kbok question comment I have used the first part of this answer to get the half float and then to get the array:

uint16_t position16float = float_to_half_branch(data);
memcpy(&dataArray, &position16float, 2);

edited Apr 13 '17 at 12:18

Community

1
1

answered Oct 10 '12 at 11:35

Ross

14,266
12
60
91

score 0 · Answer 4 · answered Dec 02 '14 at 22:34

If you are targeting to supported hardware, you can use intrinsics:

https://software.intel.com/en-us/articles/performance-benefits-of-half-precision-floats https://software.intel.com/sites/products/documentation/doclib/iss/2013/compiler/cpp-lin/GUID-7679FF37-257B-4F90-8668-5B3AA62587AD.htm

Convert int to 16bit float (half precision floating point) in c++

4 Answers4

Linked