32-bit to 16-bit Floating Point Conversion

Question

I need a cross-platform library/algorithm that will convert between 32-bit and 16-bit floating point numbers. I don't need to perform math with the 16-bit numbers; I just need to decrease the size of the 32-bit floats so they can be sent over the network. I am working in C++.

I understand how much precision I would be losing, but that's OK for my application.

The IEEE 16-bit format would be great.

Are you sure that you'll be able to measure the performance benefit from this conversion? You will need to be sending a lot of those numbers across the wire to make up a significant saving. You only get about 3 decimal digits of accuracy, and the range is not all that large either. — Jonathan Leffler, Nov 02 '09 at 05:12
OTOH, CPU is essentially free nowadays if you can thread your program, and a transform of an I/O stream is easily threadable. The savings in I/O will be real if the number of floats sent is anywhere near the network capacity. I.e. this is a good bandwidth/latency tradeoff, and as such only relevant when you actually have a bandwitdh problem and no latency issues. — MSalters, Nov 02 '09 at 11:07
@Lazer: No, the smallest size the standard supports is a 32-bit float. — Matt Fichman, Jul 24 '10 at 17:46
@Lazer, I don't think C++ even talks about the number of bits in a float. The specification is quite general. — Richard, Nov 08 '12 at 21:26
@Lazer: No, `FLT_DIG` is the number of digits supported in `float`, and it must be at least 6 which excludes 16 bit floats. Implementations are free to offer `ext::float16` types though. — MSalters, Jul 29 '15 at 13:55

Phernost · Answer 1 · 2019-11-04T08:49:06.533

Complete conversion from single precision to half precision. This is a direct copy from my SSE version, so it's branch-less. It makes use of the fact that -true == ~0 to preform branchless selections (GCC converts if statements into an unholy mess of conditional jumps, while Clang just converts them to conditional moves.)

Update (2019-11-04): reworked to support single and double precision values with fully correct rounding. I also put a corresponding if statement above each branchless select as a comment for clarity. All incoming NaNs are converted to the base quiet NaN for speed and sanity, as there is no way to reliably convert an embedded NaN message between formats.

#include <cstdint> // uint32_t, uint64_t, etc.
#include <cstring> // memcpy
#include <climits> // CHAR_BIT
#include <limits>  // numeric_limits
#include <utility> // is_integral_v, is_floating_point_v, forward

namespace std
{
  template< typename T , typename U >
  T bit_cast( U&& u ) {
    static_assert( sizeof( T ) == sizeof( U ) );
    union { T t; }; // prevent construction
    std::memcpy( &t, &u, sizeof( t ) );
    return t;
  }
} // namespace std

template< typename T > struct native_float_bits;
template<> struct native_float_bits< float >{ using type = std::uint32_t; };
template<> struct native_float_bits< double >{ using type = std::uint64_t; };
template< typename T > using native_float_bits_t = typename native_float_bits< T >::type;

static_assert( sizeof( float ) == sizeof( native_float_bits_t< float > ) );
static_assert( sizeof( double ) == sizeof( native_float_bits_t< double > ) );

template< typename T, int SIG_BITS, int EXP_BITS >
struct raw_float_type_info {
  using raw_type = T;

  static constexpr int sig_bits = SIG_BITS;
  static constexpr int exp_bits = EXP_BITS;
  static constexpr int bits = sig_bits + exp_bits + 1;

  static_assert( std::is_integral_v< raw_type > );
  static_assert( sig_bits >= 0 );
  static_assert( exp_bits >= 0 );
  static_assert( bits <= sizeof( raw_type ) * CHAR_BIT );

  static constexpr int exp_max = ( 1 << exp_bits ) - 1;
  static constexpr int exp_bias = exp_max >> 1;

  static constexpr raw_type sign = raw_type( 1 ) << ( bits - 1 );
  static constexpr raw_type inf = raw_type( exp_max ) << sig_bits;
  static constexpr raw_type qnan = inf | ( inf >> 1 );

  static constexpr auto abs( raw_type v ) { return raw_type( v & ( sign - 1 ) ); }
  static constexpr bool is_nan( raw_type v ) { return abs( v ) > inf; }
  static constexpr bool is_inf( raw_type v ) { return abs( v ) == inf; }
  static constexpr bool is_zero( raw_type v ) { return abs( v ) == 0; }
};
using raw_flt16_type_info = raw_float_type_info< std::uint16_t, 10, 5 >;
using raw_flt32_type_info = raw_float_type_info< std::uint32_t, 23, 8 >;
using raw_flt64_type_info = raw_float_type_info< std::uint64_t, 52, 11 >;
//using raw_flt128_type_info = raw_float_type_info< uint128_t, 112, 15 >;

template< typename T, int SIG_BITS = std::numeric_limits< T >::digits - 1,
  int EXP_BITS = sizeof( T ) * CHAR_BIT - SIG_BITS - 1 >
struct float_type_info 
: raw_float_type_info< native_float_bits_t< T >, SIG_BITS, EXP_BITS > {
  using flt_type = T;
  static_assert( std::is_floating_point_v< flt_type > );
};

template< typename E >
struct raw_float_encoder
{
  using enc = E;
  using enc_type = typename enc::raw_type;

  template< bool DO_ROUNDING, typename F >
  static auto encode( F value )
  {
    using flt = float_type_info< F >;
    using raw_type = typename flt::raw_type;
    static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
    static constexpr auto bit_diff = flt::bits - enc::bits;
    static constexpr auto do_rounding = DO_ROUNDING && sig_diff > 0;
    static constexpr auto bias_mul = raw_type( enc::exp_bias ) << flt::sig_bits;
    if constexpr( !do_rounding ) { // fix exp bias
      // when not rounding, fix exp first to avoid mixing float and binary ops
      value *= std::bit_cast< F >( bias_mul );
    }
    auto bits = std::bit_cast< raw_type >( value );
    auto sign = bits & flt::sign; // save sign
    bits ^= sign; // clear sign
    auto is_nan = flt::inf < bits; // compare before rounding!!
    if constexpr( do_rounding ) {
      static constexpr auto min_norm = raw_type( flt::exp_bias - enc::exp_bias + 1 ) << flt::sig_bits;
      static constexpr auto sub_rnd = enc::exp_bias < sig_diff
        ? raw_type( 1 ) << ( flt::sig_bits - 1 + enc::exp_bias - sig_diff )
        : raw_type( enc::exp_bias - sig_diff ) << flt::sig_bits;
      static constexpr auto sub_mul = raw_type( flt::exp_bias + sig_diff ) << flt::sig_bits;
      bool is_sub = bits < min_norm;
      auto norm = std::bit_cast< F >( bits );
      auto subn = norm;
      subn *= std::bit_cast< F >( sub_rnd ); // round subnormals
      subn *= std::bit_cast< F >( sub_mul ); // correct subnormal exp
      norm *= std::bit_cast< F >( bias_mul ); // fix exp bias
      bits = std::bit_cast< raw_type >( norm );
      bits += ( bits >> sig_diff ) & 1; // add tie breaking bias
      bits += ( raw_type( 1 ) << ( sig_diff - 1 ) ) - 1; // round up to half
      //if( is_sub ) bits = std::bit_cast< raw_type >( subn );
      bits ^= -is_sub & ( std::bit_cast< raw_type >( subn ) ^ bits );
    }
    bits >>= sig_diff; // truncate
    //if( enc::inf < bits ) bits = enc::inf; // fix overflow
    bits ^= -( enc::inf < bits ) & ( enc::inf ^ bits );
    //if( is_nan ) bits = enc::qnan;
    bits ^= -is_nan & ( enc::qnan ^ bits );
    bits |= sign >> bit_diff; // restore sign
    return enc_type( bits );
  }

  template< typename F >
  static F decode( enc_type value )
  {
    using flt = float_type_info< F >;
    using raw_type = typename flt::raw_type;
    static constexpr auto sig_diff = flt::sig_bits - enc::sig_bits;
    static constexpr auto bit_diff = flt::bits - enc::bits;
    static constexpr auto bias_mul = raw_type( 2 * flt::exp_bias - enc::exp_bias ) << flt::sig_bits;
    raw_type bits = value;
    auto sign = bits & enc::sign; // save sign
    bits ^= sign; // clear sign
    auto is_norm = bits < enc::inf;
    bits = ( sign << bit_diff ) | ( bits << sig_diff );
    auto val = std::bit_cast< F >( bits ) * std::bit_cast< F >( bias_mul );
    bits = std::bit_cast< raw_type >( val );
    //if( !is_norm ) bits |= flt::inf;
    bits |= -!is_norm & flt::inf;
    return std::bit_cast< F >( bits );
  }
};

using flt16_encoder = raw_float_encoder< raw_flt16_type_info >;

template< typename F >
auto quick_encode_flt16( F && value )
{ return flt16_encoder::encode< false >( std::forward< F >( value ) ); }

template< typename F >
auto encode_flt16( F && value )
{ return flt16_encoder::encode< true >( std::forward< F >( value ) ); }

template< typename F = float, typename X >
auto decode_flt16( X && value )
{ return flt16_encoder::decode< F >( std::forward< X >( value ) ); }

Of course full IEEE support isn't always needed. If your values don't require logarithmic resolution approaching zero, then linearizing them to a fixed point format is much faster, as was already mentioned.

At the beginning you write that it relies on GCC's `(-true == ~0)`. I want to use your code snippet in Visual Studio 2012, do you have an input+expected output pair that could tell me whether my compiler does the right thing? It does seem to convert forth and back without issues and aforementioned expression holds true. — Cygon, Jun 19 '13 at 09:49
The Unlicense (http://choosealicense.com/licenses/unlicense/) which is public domain. — Phernost, May 27 '16 at 00:41
@Cygon `-true == ~0` is always guaranteed by the standard as long as you convert the `bool` to an *unsigned* integer type before the `-`, because unsigned integers are guaranteed to take negative values modulo 2^n (i.e. practically twos-complement representation of negative values). So `-static_cast(true)` is the same as `0xFFFFFFFF` or `~static_cast(0)` **by standard**. It *should* also work on nearly any practical system for signed types (because they're usually twos-complement anyway), but that's theoretically implementation-defined. But "unsigned negatives" always work. — Christian Rau, Mar 14 '17 at 14:50
@Cygon And to answer your specific question, all versions of Visual Studio use twos-complement for signed types (not just from experience, it's explicitly documented by the compiler). So yes, even the implementation-defined version with signed integer types will always work in Visual Studio. — Christian Rau, Mar 14 '17 at 14:57
This code does not seem to be rounding to nearest even, which is the default IEEE rounding mode. Therefore it is not IEEE compliant. — hpd, Sep 16 '19 at 05:26
It's been fixed. Rounding is optional, since it only affects the last digit of precision at a cost of triple the ops. — Phernost, Nov 04 '19 at 08:55
You can do `~true + 1 == ~0` for it to work on non-two's complement (not that it matters anyway with C++20, but it shouldn't incur a performance penalty if the optimizer is smart). — S.S. Anne, Feb 06 '20 at 18:28
Does this also work for converting to larger floating point types, as currently the `_diff` variables can become negative? — Matthias, Apr 05 '20 at 16:31
Yes, it's possible to go larger. `encode` is a just compression function, so it expects both the input type to be larger than the output type. `decode` is what you want if you're converting to larger types (16 to 32, 32 to 128, etc.). I originally had many more `static_assert`s to prevent bad types, but I removed most of them to make it more readable ... I guess I removed too many. It's possible to combine both functions into a smart conversion function with a bunch of `if constexpr`s that act based on the input and output types' individual `sig_bits`, `exp_bits`, and `exp_bias`. — Phernost, Aug 06 '20 at 12:28

score 21 · Answer 2 · answered Nov 06 '14 at 12:09

21

Half to float:
float f = ((h&0x8000)<<16) | (((h&0x7c00)+0x1C000)<<13) | ((h&0x03FF)<<13);

Float to half:
uint32_t x = *((uint32_t*)&f);
uint16_t h = ((x>>16)&0x8000)|((((x&0x7f800000)-0x38000000)>>13)&0x7c00)|((x>>13)&0x03ff);

answered Nov 06 '14 at 12:09

user2459387

227
2
3

7

But of course keep in mind that this currently ignores any kind of overflow, underflow, denormalized values, or infinite values. – Christian Rau Nov 06 '14 at 13:55
1

This does not work for 0. – tommsch Apr 28 '22 at 13:33

score 20 · Accepted Answer · edited Aug 28 '13 at 08:12

20

std::frexp extracts the significand and exponent from normal floats or doubles -- then you need to decide what to do with exponents that are too large to fit in a half-precision float (saturate...?), adjust accordingly, and put the half-precision number together. This article has C source code to show you how to perform the conversion.

edited Aug 28 '13 at 08:12

answered Nov 02 '09 at 04:55

Alex Martelli

854,459
170
1,222
1,395

Actually, the values I'm sending have very limited range: (-1000, 1000) so the exponent isn't that big of an issue. – Matt Fichman Nov 02 '09 at 05:04
@Matt, if you **know** the exponent will never under/over flow, then your job's easier by that much!-) – Alex Martelli Nov 02 '09 at 05:41
@Alex, indeed, it does make it easier! Thanks. – Matt Fichman Nov 03 '09 at 01:37

ProjectPhysX · Answer 4 · 2022-08-02T19:32:12.893

Why so over-complicated? My implementation does not need any additional library, complies with the IEEE-754 FP16 format, manages both normalized and denormalized numbers, is branch-less, takes about 40-ish clock cycles for a back and forth conversion and ditches NaN or Inf for an extended range. That's the magical power of bit operations.

typedef unsigned short ushort;
typedef unsigned int uint;

uint as_uint(const float x) {
    return *(uint*)&x;
}
float as_float(const uint x) {
    return *(float*)&x;
}

float half_to_float(const ushort x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
    const uint e = (x&0x7C00)>>10; // exponent
    const uint m = (x&0x03FF)<<13; // mantissa
    const uint v = as_uint((float)m)>>23; // evil log2 bit hack to count leading zeros in denormalized format
    return as_float((x&0x8000)<<16 | (e!=0)*((e+112)<<23|m) | ((e==0)&(m!=0))*((v-37)<<23|((m<<(150-v))&0x007FE000))); // sign : normalized : denormalized
}
ushort float_to_half(const float x) { // IEEE-754 16-bit floating-point format (without infinity): 1-5-10, exp-15, +-131008.0, +-6.1035156E-5, +-5.9604645E-8, 3.311 digits
    const uint b = as_uint(x)+0x00001000; // round-to-nearest-even: add last bit after truncated mantissa
    const uint e = (b&0x7F800000)>>23; // exponent
    const uint m = b&0x007FFFFF; // mantissa; in line below: 0x007FF000 = 0x00800000-0x00001000 = decimal indicator flag - initial rounding
    return (b&0x80000000)>>16 | (e>112)*((((e-112)<<10)&0x7C00)|m>>13) | ((e<113)&(e>101))*((((0x007FF000+m)>>(125-e))+1)>>1) | (e>143)*0x7FFF; // sign : normalized : denormalized : saturate
}

Example for how to use it and to check that the conversion is correct:

#include <iostream>

void print_bits(const ushort x) {
    for(int i=15; i>=0; i--) {
        cout << ((x>>i)&1);
        if(i==15||i==10) cout << " ";
        if(i==10) cout << "      ";
    }
    cout << endl;
}
void print_bits(const float x) {
    uint b = *(uint*)&x;
    for(int i=31; i>=0; i--) {
        cout << ((b>>i)&1);
        if(i==31||i==23) cout << " ";
        if(i==23) cout << "   ";
    }
    cout << endl;
}

int main() {
    const float x = 1.0f;
    const ushort x_compressed = float_to_half(x);
    const float x_decompressed = half_to_float(x_compressed);
    print_bits(x);
    print_bits(x_compressed);
    print_bits(x_decompressed);
    return 0;
}

Output:

0 01111111    00000000000000000000000
0 01111       0000000000
0 01111111    00000000000000000000000

I have published an adapted version of this FP32<->FP16 conversion algorithm in this paper with detailed description on how the bit manipulation magic works. In this paper I also provide several ultra-fast conversion algorithms for different 16-bit Posit formats.

One question, though: What does `as_uint((float)m)` do? Isn't it a NO-OP? I mean, I wonder why you don't write the line for the "bit hack" like this instead: `const uint v = m>>23;` — cesss, Oct 30 '21 at 12:54
@cesss this casts the integer m to float and then extracts the exponent bits from this float. The cast implicitly does a log2 to compute the exponent, and this is what I leverage to count the leading zeros. Note that float cast ( (float)m ) and reinterpteting bits as integer ( as_uint ) are very different things: the cast changes the bits (but not the represented number apart from rounding) and the reinterpreting does not change the bits (but the represented number is completely different). — ProjectPhysX, Oct 31 '21 at 07:27
Thanks, @ProjectPhysX, with the hurry I didn't realize you weren't casting to integer. BTW, I tend to believe this is UB, because it's type-punning without a union. — cesss, Oct 31 '21 at 07:58
The sanitizer said (125-e) is underflowing when input some numbers. — Zz Tux, Nov 10 '21 at 13:21
@Zz Tux in this case the denormalized part is discarded anyways by multiplying with `e>101`. — ProjectPhysX, Nov 11 '21 at 07:10
@ProjectPhysX But when (125-e) underflowing, the shift exponent could become a very large number, which is an undefined behaviour as the c++ standard. — Zz Tux, Nov 11 '21 at 08:08
in c++20, `std::bit_cast` can replace the `as_uint` and `as_float` functions — Tom Huntington, Jan 26 '23 at 21:32

score 18 · Answer 5 · answered Nov 02 '09 at 05:35

18

Given your needs (-1000, 1000), perhaps it would be better to use a fixed-point representation.

//change to 20000 to SHORT_MAX if you don't mind whole numbers
//being turned into fractional ones
const int compact_range = 20000;

short compactFloat(double input) {
    return round(input * compact_range / 1000);
}
double expandToFloat(short input) {
    return ((double)input) * 1000 / compact_range;
}

This will give you accuracy to the nearest 0.05. If you change 20000 to SHORT_MAX you'll get a bit more accuracy but some whole numbers will end up as decimals on the other end.

answered Nov 02 '09 at 05:35

Artelius

48,337
13
89
105

+1 This will get you *much more* accuracy than a 16 bit float in almost every case, and with less math and no special cases. A 16-bit IEEE float will only have 10 bits of accuracy and crams half of its possible values in the range (-1, 1) – Drew Dormann Nov 02 '09 at 08:31
13

It depends on the distribution in the range [-1000, 1000]. If most numbers are in fact in the range [-1,1], then the accuracy of 16 bits floats is on average better. – MSalters Nov 02 '09 at 10:13
This would be better with SHORT_MAX and 1024 as the scale factor, giving a 10.6bit fixed point representation and allintegers would be exactly representable. The precision would be 1/2^6 = 0.015625, which is far better than 0.05, and the power-of-two scale factor is easy to optimise to a bit-shift (the compiler is likely to do it for you). – Clifford Nov 02 '09 at 20:59
Sorry that should have been 11.5 (forgot the sign bit!). Then the precision is 1/2^5 = 0.0325; still not bad for something that will also perform better. – Clifford Nov 02 '09 at 21:01
@Clifford: Totally right. I have no idea why I didn't think of the 1024 thing. – Artelius Nov 02 '09 at 22:03
@Shmoopty: Most of my values will be in the [-1,1] range, as they are normalized quaternions & vectors. Some values are in the (-1000,1000) range because they are position vectors, but in this case precision is not as necessary. – Matt Fichman Nov 03 '09 at 01:39
2

@Matt, is it possible to send your normalised values using a different format to the position vectors? Consider using an appropriate fixed-point scheme for each of them. – Artelius Nov 03 '09 at 06:06

score 5 · Answer 6 · answered Nov 02 '09 at 10:01

5

If you're sending a stream of information across, you could probably do better than this, especially if everything is in a consistent range, as your application seems to have.

Send a small header, that just consists of a float32 minimum and maximum, then you can send across your information as a 16 bit interpolation value between the two. As you also say that precision isn't much of an issue, you could even send 8bits at a time.

Your value would be something like, at reconstruction time:

float t = _t / numeric_limits<unsigned short>::max();  // With casting, naturally ;)
float val = h.min + t * (h.max - h.min);

Hope that helps.

-Tom

answered Nov 02 '09 at 10:01

tsalter

287
1
4

This is a great solution, especially for normalized vector/quaternion values which you know will always be in the range (-1, 1). – Matt Fichman Nov 03 '09 at 02:00
the problem with using interpolation instead of just scaling, is that zero is not represented exactly and some systems are sensitive to that such as 4x4 matrix math. for example, say (min,max-min) is (-11.356439590454102, 23.32344913482666), then the closest you can get to zero is 0.00010671140473306195. – milkplus Jan 05 '11 at 20:22
Thanks, just used this approach to optimize the size of my save games. Used value "0" to store exact 0.0000. – Andreas Aug 26 '11 at 23:23

awdz9nld · Answer 7 · 2013-07-09T09:19:41.397

This conversion for 16-to-32-bit floating point is quite fast for cases where you do not have to account for infinities or NaNs, and can accept denormals-as-zero (DAZ). I.e. it is suitable for performance-sensitive calculations, but you should beware of division by zero if you expect to encounter denormals.

Note that this is most suitable for x86 or other platforms that have conditional moves or "set if" equivalents.

Strip the sign bit off the input
Align the most significant bit of the mantissa to the 22nd bit
Adjust the exponent bias
Set bits to all-zero if the input exponent is zero
Re-insert sign bit

The reverse applies for single-to-half-precision, with some additions.

void float32(float* __restrict out, const uint16_t in) {
    uint32_t t1;
    uint32_t t2;
    uint32_t t3;

    t1 = in & 0x7fff;                       // Non-sign bits
    t2 = in & 0x8000;                       // Sign bit
    t3 = in & 0x7c00;                       // Exponent

    t1 <<= 13;                              // Align mantissa on MSB
    t2 <<= 16;                              // Shift sign bit into position

    t1 += 0x38000000;                       // Adjust bias

    t1 = (t3 == 0 ? 0 : t1);                // Denormals-as-zero

    t1 |= t2;                               // Re-insert sign bit

    *((uint32_t*)out) = t1;
};

void float16(uint16_t* __restrict out, const float in) {
    uint32_t inu = *((uint32_t*)&in);
    uint32_t t1;
    uint32_t t2;
    uint32_t t3;

    t1 = inu & 0x7fffffff;                 // Non-sign bits
    t2 = inu & 0x80000000;                 // Sign bit
    t3 = inu & 0x7f800000;                 // Exponent

    t1 >>= 13;                             // Align mantissa on MSB
    t2 >>= 16;                             // Shift sign bit into position

    t1 -= 0x1c000;                         // Adjust bias

    t1 = (t3 > 0x38800000) ? 0 : t1;       // Flush-to-zero
    t1 = (t3 < 0x8e000000) ? 0x7bff : t1;  // Clamp-to-max
    t1 = (t3 == 0 ? 0 : t1);               // Denormals-as-zero

    t1 |= t2;                              // Re-insert sign bit

    *((uint16_t*)out) = t1;
};

Note that you can change the constant 0x7bff to 0x7c00 for it to overflow to infinity.

See GitHub for source code.

You probably meant `0x80000000` instead of `0x7FFFFFFF` as otherwise you would be doing an abs istead of zeroing. The last operation could also be written as: `t1 &= 0x80000000 | (static_cast(t3==0)-1)`. Though it probably depends on the platform (its sensitivity to branch-prediction failures, presence of conditional assignment instruction, ...) and the compiler (its ability to generate appropriate code for the platform itself) which one is better. Your version might look nicer and clearer to someone not that deeply acquainted with binary operations and *C++*'s type rules. — Christian Rau, Feb 27 '13 at 17:33
Thanks for spotting that, I've incorporated your comments into the answer. — awdz9nld, Feb 27 '13 at 18:06
In float16, the Clamp-to-max test is clearly wrong, it is always triggered. The flush-to-zero test has the comparison sign the wrong way. I *think* the two tests should be: `t1 = (t3 < 0x38800000) ? 0 : t1;` and `t1 = (t3 > 0x47000000) ? 0x7bff : t1;` — Frepa, Jul 22 '15 at 12:49
Then the denormals-as-zero test is redundant, as Flush-to-zero will catch this case too. — Frepa, Jul 22 '15 at 13:06

score 4 · Answer 8 · answered Mar 25 '15 at 18:21

Most of the approaches described in the other answers here either do not round correctly on conversion from float to half, throw away subnormals which is a problem since 2**-14 becomes your smallest non-zero number, or do unfortunate things with Inf / NaN. Inf is also a problem because the largest finite number in half is a bit less than 2^16. OpenEXR was unnecessarily slow and complicated, last I looked at it. A fast correct approach will use the FPU to do the conversion, either as a direct instruction, or using the FPU rounding hardware to make the right thing happen. Any half to float conversion should be no slower than a 2^16 element lookup table.

The following are hard to beat:

On OS X / iOS, you can use vImageConvert_PlanarFtoPlanar16F and vImageConvert_Planar16FtoPlanarF. See Accelerate.framework.

Intel ivybridge added SSE instructions for this. See f16cintrin.h. Similar instructions were added to the ARM ISA for Neon. See vcvt_f32_f16 and vcvt_f16_f32 in arm_neon.h. On iOS you will need to use the arm64 or armv7s arch to get access to them.

Ondřej Čertík · Answer 9 · 2018-11-13T21:26:01.273

4

This code converts a 32-bit floating point number to 16-bits and back.

#include <x86intrin.h>
#include <iostream>

int main()
{
    float f32;
    unsigned short f16;
    f32 = 3.14159265358979323846;
    f16 = _cvtss_sh(f32, 0);
    std::cout << f32 << std::endl;
    f32 = _cvtsh_ss(f16);
    std::cout << f32 << std::endl;
    return 0;
}

I tested with the Intel icpc 16.0.2:

$ icpc a.cpp

g++ 7.3.0:

$ g++ -march=native a.cpp

and clang++ 6.0.0:

$ clang++ -march=native a.cpp

It prints:

$ ./a.out
3.14159
3.14062

Documentation about these intrinsics is available at:

https://software.intel.com/en-us/node/524287

https://clang.llvm.org/doxygen/f16cintrin_8h.html

edited Nov 13 '18 at 21:26

answered Mar 28 '17 at 23:32

Ondřej Čertík

780
8
18

For those frustrated by this not compiling: Try the compiler flag `-march=native`. – user14717 Nov 09 '18 at 20:18
Thanks @user14717, I added exact instructions to compile this with Intel, GCC and Clang. – Ondřej Čertík Nov 13 '18 at 21:26

score 4 · Answer 10 · answered Feb 23 '12 at 17:08

4

This question is already a bit old, but for the sake of completeness, you might also take a look at this paper for half-to-float and float-to-half conversion.

They use a branchless table-driven approach with relatively small look-up tables. It is completely IEEE-conformant and even beats Phernost's IEEE-conformant branchless conversion routines in performance (at least on my machine). But of course his code is much better suited to SSE and is not that prone to memory latency effects.

answered Feb 23 '12 at 17:08

Christian Rau

45,360
10
108
185

1

+1 This paper is very good. Note that it is not *completely* IEEE-conformant in the way it handles NaN. IEEE says that a number is NaN only if at least one of the mantissa bits is set. As the provided code ignores lower order bits, some 32-bit NaNs are wrongly converted to Inf. Unlikely to happen, though. – sam hocevar Jun 22 '12 at 11:33

eestrada · Answer 11 · 2012-12-12T02:05:05.443

1

The question is old and has already been answered, but I figured it would be worth mentioning an open source C++ library that can create 16bit IEEE compliant half precision floats and has a class that acts pretty much identically to the built in float type, but with 16 bits instead of 32. It is the "half" class of the OpenEXR library. The code is under a permissive BSD style license. I don't believe it has any dependencies outside of the standard library.

edited Dec 12 '12 at 02:05

answered Dec 12 '12 at 01:52

eestrada

1,575
14
24

1

While we're talking about open source C++ libraries providing IEEE-conformant half-precision types that act like the builtin floating point types as much as possible, take a look at the [*half* library](http://half.sourceforge.net/) (disclaimer: it's from me). – Christian Rau Dec 12 '12 at 12:39

score 1 · Answer 12 · answered Sep 09 '14 at 04:50

I had this same exact problem, and found this link very helpful. Just import the file "ieeehalfprecision.c" into your project and use it like this :

float myFloat = 1.24;
uint16_t resultInHalf;
singles2halfp(&resultInHalf, &myFloat, 1); // it accepts a series of floats, so use 1 to input 1 float

// an example to revert the half float back
float resultInSingle;
halfp2singles(&resultInSingle, &resultInHalf, 1);

I also change some code (See the comment by the author (James Tursa) in the link) :

#define INT16_TYPE int16_t 
#define UINT16_TYPE uint16_t 
#define INT32_TYPE int32_t 
#define UINT32_TYPE uint32_t

ErmIg · Answer 13 · 2017-05-12T11:16:16.300

I have found an implementation of conversion from half-float to single-float format and back with using of AVX2. There are much more faster than software implementation of these algorithms. I hope it will be useful.

32-bit float to 16-bit float conversion:

#include <immintrin.h"

inline void Float32ToFloat16(const float * src, uint16_t * dst)
{
    _mm_storeu_si128((__m128i*)dst, _mm256_cvtps_ph(_mm256_loadu_ps(src), 0));
}

void Float32ToFloat16(const float * src, size_t size, uint16_t * dst)
{
    assert(size >= 8);

    size_t fullAlignedSize = size&~(32-1);
    size_t partialAlignedSize = size&~(8-1);

    size_t i = 0;
    for (; i < fullAlignedSize; i += 32)
    {
        Float32ToFloat16(src + i + 0, dst + i + 0);
        Float32ToFloat16(src + i + 8, dst + i + 8);
        Float32ToFloat16(src + i + 16, dst + i + 16);
        Float32ToFloat16(src + i + 24, dst + i + 24);
    }
    for (; i < partialAlignedSize; i += 8)
        Float32ToFloat16(src + i, dst + i);
    if(partialAlignedSize != size)
        Float32ToFloat16(src + size - 8, dst + size - 8);
}

16-bit float to 32-bit float conversion:

#include <immintrin.h"

inline void Float16ToFloat32(const uint16_t * src, float * dst)
{
    _mm256_storeu_ps(dst, _mm256_cvtph_ps(_mm_loadu_si128((__m128i*)src)));
}

void Float16ToFloat32(const uint16_t * src, size_t size, float * dst)
{
    assert(size >= 8);

    size_t fullAlignedSize = size&~(32-1);
    size_t partialAlignedSize = size&~(8-1);

    size_t i = 0;
    for (; i < fullAlignedSize; i += 32)
    {
        Float16ToFloat32<align>(src + i + 0, dst + i + 0);
        Float16ToFloat32<align>(src + i + 8, dst + i + 8);
        Float16ToFloat32<align>(src + i + 16, dst + i + 16);
        Float16ToFloat32<align>(src + i + 24, dst + i + 24);
    }
    for (; i < partialAlignedSize; i += 8)
        Float16ToFloat32<align>(src + i, dst + i);
    if (partialAlignedSize != size)
        Float16ToFloat32<false>(src + size - 8, dst + size - 8);
}

score -1 · Answer 14 · answered Feb 17 '22 at 17:50

Thanks Code for decimal to single precision

We actually can try to edit the same code to half precision , however it is not possible with gcc C compiler , so do the following

sudo apt install clang

Then try the following code

// A C code to convert Decimal value to IEEE 16-bit floating point Half precision

#include <stdio.h>

void printBinary(int n, int i)
{
 

    int k;
    for (k = i - 1; k >= 0; k--) {
 
        if ((n >> k) & 1)
            printf("1");
        else
            printf("0");
    }
}
 
typedef union {
    
    __fp16 f;
    struct
    {
        unsigned int mantissa : 10;
        unsigned int exponent : 5;
        unsigned int sign : 1;
 
    } raw;
} myfloat;
 

// Driver Code
int main()
{
    myfloat var;
    var.f = 11;
    printf("%d | ", var.raw.sign);
    printBinary(var.raw.exponent, 5);
    printf(" | ");
    printBinary(var.raw.mantissa, 10);
    printf("\n");
    return 0;
}

Compile the code in your terminal

clang code_name.c -o code_name
./code_name

Here

__fp16

is a 2 byte float data type supported in clang C compiler

32-bit to 16-bit Floating Point Conversion

14 Answers14

Linked

Related