Float32 to Float16

Question

Can someone explain to me how I convert a 32-bit floating point value to a 16-bit floating point value?

(s = sign e = exponent and m = mantissa)

If 32-bit float is 1s7e24m
And 16-bit float is 1s5e10m

Then is it as simple as doing?

int     fltInt32;
short   fltInt16;
memcpy( &fltInt32, &flt, sizeof( float ) );

fltInt16 = (fltInt32 & 0x00FFFFFF) >> 14;
fltInt16 |= ((fltInt32 & 0x7f000000) >> 26) << 10;
fltInt16 |= ((fltInt32 & 0x80000000) >> 16);

I'm assuming it ISN'T that simple ... so can anyone tell me what you DO need to do?

Edit: I cam see I've got my exponent shift wrong ... so would THIS be better?

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x7c000000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

I'm hoping this is correct. Apologies if I'm missing something obvious that has been said. Its almost midnight on a friday night ... so I'm not "entirely" sober ;)

Edit 2: Ooops. Buggered it again. I want to lose the top 3 bits not the lower! So how about this:

fltInt16 =  (fltInt32 & 0x007FFFFF) >> 13;
fltInt16 |= (fltInt32 & 0x0f800000) >> 13;
fltInt16 |= (fltInt32 & 0x80000000) >> 16;

Final code should be:

fltInt16    =  ((fltInt32 & 0x7fffffff) >> 13) - (0x38000000 >> 13);
fltInt16    |= ((fltInt32 & 0x80000000) >> 16);

I think this was already asked (and answered) here: http://stackoverflow.com/questions/1659440/32-bit-to-16-bit-floating-point-conversion — humbagumba, Jun 11 '10 at 21:54
it could be that simple, but you loose precision unless float32 does not use all the "precision" it has... basically, you get 5/7 of the bits of exp (you take of course the most significative ones), and 10/24 of the mantissa; these ratios say someway how much you can loose in the conversion. exactly as it happens if you want to fit 32bits integer into a 16bits integer... the range of rappresentable numbers is smaller; "cutting" the mantissa reduces the "precision", and the exponent also limits the range: 5 signed bits give -16 to +15, against -64/+63 (if I did it right...:D late it is) — ShinTakezou, Jun 11 '10 at 21:58
@ShinTakezou: Surely its not possible to lose 16-bits of data and NOT lose precision?? Float16 is far less precise and thus automatically has less precision ... or am i misunderstanding you? — Goz, Jun 11 '10 at 22:01
you can loose 16bits and have the "float16" representing the "exact" same number of the float32, simply you have to "choose" the float32 number so that it happens... but normally one can't choose, so that most of the time what happens is loosing information. Differently speaking, you can fit into float32 any float16 number (provided the same conventions) and "return" back again to float16 loosing nothing (f16 > f32 won't "invent" "precision" and so f16 > f32 > f16' can be done so that f16' === f16) — ShinTakezou, Jun 12 '10 at 13:28

sam hocevar · Answer 1 · 2011-04-07T21:43:54.577

The exponent needs to be unbiased, clamped and rebiased. This is the fast code I use:

unsigned int fltInt32;
unsigned short fltInt16;

fltInt16 = (fltInt32 >> 31) << 5;
unsigned short tmp = (fltInt32 >> 23) & 0xff;
tmp = (tmp - 0x70) & ((unsigned int)((int)(0x70 - tmp) >> 4) >> 27);
fltInt16 = (fltInt16 | tmp) << 10;
fltInt16 |= (fltInt32 >> 13) & 0x3ff;

This code will be even faster with a lookup table for the exponent, but I use this one because it is easily adapted to a SIMD workflow.

Limitations of the implementation:

Overflowing values that cannot be represented in float16 will give undefined values.
Underflowing values will return an undefined value between 2^-15 and 2^-14 instead of zero.
Denormals will give undefined values.

Be careful with denormals. If your architecture uses them, they may slow down your program tremendously.

Pascal Cuoq · Accepted Answer · 2010-06-11T21:58:56.867

4

The exponents in your float32 and float16 representations are probably biased, and biased differently. You need to unbias the exponent you got from the float32 representation to get the actual exponent, and then to bias it for the float16 representation.

Apart from this detail, I do think it's as simple as that, but I still get surprised by floating-point representations from time to time.

EDIT:

Check for overflow when doing the thing with the exponents while you're at it.
Your algorithm truncates the last bits of the mantisa a little abruptly, that may be acceptable but you may want to implement, say, round-to-nearest by looking at the bits that are about to be discarded. "0..." -> round down, "100..001..." -> round up, "100..00" -> round to even.

edited Jun 11 '10 at 21:58

answered Jun 11 '10 at 21:53

Pascal Cuoq

79,187
7
161
281

32 bit floating point numbers in the IEEE754 Standard have 23 bits of mantissa and 8 bits exponent. – bbudge Jun 11 '10 at 21:57
@bbudge ... fair enough I was trying to do it from memory. I took the wrong bit away, evidently ;) – Goz Jun 11 '10 at 22:02

score 4 · Answer 3 · answered Jun 11 '10 at 21:58

4

Here's the link to an article on IEEE754, which gives the bit layouts and biases.

http://en.wikipedia.org/wiki/IEEE_754-2008

answered Jun 11 '10 at 21:58

bbudge

1,127
6
7

Float32 to Float16

3 Answers3

Linked