How to add and subtract 16 bit floating point half precision numbers?

Question

How do I add and subtract 16 bit floating point half precision numbers?

Say I need to add or subtract:

1 10000 0000000000

1 01111 1111100000

2’s complement form.

Please provide more context. C has no such thing as half precision. — Raymond Chen, Oct 02 '11 at 00:14
16 bit precision, but in what format/standard? Is there sign bit? How many bits for mantis and how many for exponenta? — Ernest Staszuk, Oct 02 '11 at 00:16
There's generally no hardware support for half-precision arithmetic. (so there's no easy way to do this) The Intel Compiler supports intrinsics for converting half-precision to and from single-precision. — Mysticial, Oct 02 '11 at 00:16
@Ernest Staszuk It looks like the format is sign(biased?)exponentmantissa Other than converting to a float or double and back, you could write all the bitwise logic for adding and subtracting the numbers. — stardt, Oct 02 '11 at 00:19
What platform ? Some platforms, e.g. CUDA, have support for 16 bit half precision, but most don't. — Paul R, Oct 02 '11 at 08:34

score 1 · Answer 1 · answered Oct 02 '11 at 00:28

1

The OpenEXR library defines a half-precision floating point class. It's C++, but the code for casting between native IEEE754 float and half should be easy to adapt. see: Half/half.h as a start.

answered Oct 02 '11 at 00:28

Brett Hale

21,653
2
61
90

score 0 · Accepted Answer · answered Oct 02 '11 at 00:42

0

Assuming you are using a denormalized representation similar to that of IEEE single/double precision, just compute the sign = (-1)^S, the mantissa as 1.M if E != 0 and 0.M if E == 0, and the exponent = E - 2^(n-1), operate on these natural representations, and convert back to the 16-bit format.

sign1 = -1 mantissa1 = 1.0 exponent1 = 1

sign2 = -1 mantissa2 = 1.11111 exponent2 = 0

sum: sign = -1 mantissa = 1.111111 exponent = 1

Representation: 1 10000 1111110000

Naturally, this assumes excess encoding of the exponent.

answered Oct 02 '11 at 00:42

Patrick87

27,682
3
38
73

What do you mean by: Naturally, this assumes excess encoding of the exponent. – Snow_Mac Oct 18 '11 at 03:36
Just that some of the actual math I did only makes sense if you understand the exponent to be in excess encoding. If you aren't using excess encoding to represent the exponent E, then for instance the rule "E != 0" must be changed... excess encoding is just a way to encode exponents so that negative exponents are smaller than positive exponents under unsigned comparisons. Wikipedia might have a good discussion of this... otherwise, I will oblige. – Patrick87 Oct 18 '11 at 16:33

How to add and subtract 16 bit floating point half precision numbers?

2 Answers2

Linked