Why is there no 2-byte float and does an implementation already exist?

Question

Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and 1, that is, return a float like this: shortComingIn / maxRangeOfShort). Is there an implementation that already exists for a 2-byte float?

I am also interested to know any (historical?) reasons as to why there is no 2-byte float.

It's called half-precision floating point in IEEE lingo, and implementations exist, just not in the C standard primitives (which C++ uses by extension). The C standard only dictates single-precision, double-precision, and long double floating point (which could be 80-bit or 128-bit). — wkl, Apr 23 '11 at 21:01
A question should be exactly that: **A** question. If you want references to implementations of `half` for C++, that's a question. If you're interested in historical reasons that `float` is a four-byte entity, that's a *different* question. — T.J. Crowder, Apr 23 '11 at 21:01
@Crowder: I'll take that into account next time (and will quote you if you don't mind). I was recently in a debate with somebody on one of my questions with that exact problem but me being on the other end (they said it was a duplicate while I thought it was a different question) so with that in the back of my mind, I asked it in the same question. — Samaursa, Apr 23 '11 at 21:04
Half-precision floating point has now been in the IEEE spec for ten years. Does anyone know why it's still not a built-in type in C++? — All The Rage, Nov 12 '18 at 17:16
@AlltheRage did you even read the answers? Even 4-byte float isn't enough and most languages have floating-point literal to be double by default. 2-byte float is so severely limited in normal arithmetic, hence only used in cases where there's a huge array of values that don't need high precision. In that case you'll better off with SIMD or special routines instead of a scalar type in C++ — phuclv, Mar 11 '20 at 02:01
No need to be insolent, bro. The world’s fastest processors have hardware support for half precision. It’s used all the time in machine learning, graphics, and video games. The film industry uses it extensively for rendering. But if it’s people who don’t understand the use cases who are defining the languages I guess that would answer my question. — All The Rage, Mar 12 '20 at 03:44
@AlltheRage those use cases are very new whereas C++ was invented decades ago. Besides as I said, half float is only used when processing in batch (either in machine learning or graphics) and never used alone in expressions so the only reasonable appearance of it in a high-level programming language is in a vector of them — phuclv, Mar 12 '20 at 13:44

phuclv · Answer 1 · 2022-09-03T07:19:34.457

22

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

There are currently 2 common standard 16-bit float formats: IEEE-754 binary16 and Google's bfloat16. Since they're standardized, obviously anyone who knows the spec can write an implementation. Some examples:

Or if you don't want to use them, you can also design a different 16-bit float format and implement it

2-byte floats are generally not used, because even float's precision is not enough for normal operations and double should always be used by default unless you're limited by bandwidth or cache size. Floating-point literals are also double when using without a suffix in C and C-like languages. See

However less-than-32-bit floats do exist. They're mainly used for storage purposes, like in graphics when 96 bits per pixel (32 bits per channel * 3 channels) are far too wasted, and will be converted to a normal 32-bit float for calculations (except on some special hardware). Various 10, 11, 14-bit float types exist in OpenGL. Many HDR formats use a 16-bit float for each channel, and Direct3D 9.0 as well as some GPUs like the Radeon R300 and R420 have a 24-bit float format. A 24-bit float is also supported by compilers in some 8-bit microcontrollers like PIC where 32-bit float support is too costly. 8-bit or narrower float types are less useful but due to their simplicity, they're often taught in computer science curriculum. Besides, a small float is also used in ARM's instruction encoding for small floating-point immediates.

The IEEE 754-2008 revision officially added a 16-bit float format, A.K.A binary16 or half-precision, with a 5-bit exponent and an 11-bit mantissa

Some compilers had support for IEEE-754 binary16, but mainly for conversion or vectorized operations and not for computation (because they're not precise enough). For example ARM's toolchain has __fp16 which can be chosen between 2 variants: IEEE and alternative depending on whether you want more range or NaN/inf representations. GCC and Clang also support __fp16 along with the standardized name _Float16. See How to enable __fp16 type on gcc for x86_64

Recently due to the rise of AI, another format called bfloat16 (brain floating-point format) which is a simple truncation of the top 16 bits of IEEE-754 binary32 became common

The motivation behind the reduced mantissa is derived from Google's experiments that showed that it is fine to reduce the mantissa so long it's still possible to represent tiny values closer to zero as part of the summation of small differences during training. Smaller mantissa brings a number of other advantages such as reducing the multiplier power and physical silicon area.

float32: 24²=576 (100%)

float16: 11²=121 (21%)

bfloat16: 8²=64 (11%)

Many compilers like GCC and ICC now also gained the ability to support bfloat16

More information about bfloat16:

In cases where bfloat16 is not enough there's also the rise of a new 19-bit type called TensorFloat

edited Sep 03 '22 at 07:19

answered May 07 '19 at 05:40

phuclv

37,963
15
156
475

"GCC and Clang also support __fp16 along with the standardized name _Float16" - _Float16 doesn't seem to be supported in GCC. GCC half page doesn't mention this name, and the only answer in linked question claims they didn't find the way to enable it. – S. Kaczor Oct 19 '20 at 21:06
@S.Kaczor `_Float16` appears in all those pages: [*"It is recommended that portable code use the `_Float16` type defined by ISO/IEC TS 18661-3:2015"*](https://gcc.gnu.org/onlinedocs/gcc/Half-Precision.html), [*Clang supports three half-precision (16-bit) floating point types: `__fp16`, `_Float16` and `__bf16`. These types are supported in all language modes.*](https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point) – phuclv Oct 19 '20 at 23:35
Many other compilers like [armcc](https://developer.arm.com/documentation/100067/0610/Other-Compiler-specific-Features/Library-support-for--Float16-data-type) or [Keil](https://www.keil.com/support/man/docs/armclang_ref/armclang_ref_sex1519040854421.htm) also support that keyword. But `_Float16` isn't available on all targets: [*The `_Float16` type is supported on AArch64 systems by default, and on ARM systems when the IEEE format for 16-bit floating-point types is selected with `-mfp16-format=ieee`*](https://gcc.gnu.org/onlinedocs/gcc/Floating-Types.html) – phuclv Oct 19 '20 at 23:35
MSVC supports `HALF` via DirectX: https://learn.microsoft.com/en-us/windows/win32/dxmath/half-data-type – Matt Eding Aug 24 '21 at 17:38
On 64b machines float doesn't offer much outside of SIMD-like vector operations. The extra range of double is useful, but even a 32bit float offers more precision than is really needed in most cases. When is the last time you ever did anything practical to 7 significant [decimal] figures? In physical terms that is measuring something 500 feet long to +- 1/1000 of an inch. There are certain math ops that can harm those 7 digits but using double just partially obscures the symptoms, those same math quirks also harm a double. The real solution is to use an algorithm that avoids those traps. – Max Power Jan 26 '22 at 00:01
@MaxPower if everything is just as simple as that then half-float and other small floats wouldn't exist and fixed-point math would be extremely common. In a few cases you don't care about the precision because the dynamic range is more important. But of course you'd have to prove that mathematically which isn't something for beginners. That's why it's only commonly used in imaging and AI – phuclv Jan 26 '22 at 01:05
@phuclv I'm not sure you read my whole post, it has little to do with your reply. Anyway as an aside, fixed point is extremely common, its called integer math. (With a dot slapped on for display.) – Max Power Jan 26 '22 at 01:39
@MaxPower you're completely wrong. Read the papers on floating-point and you'll see that float is far from enough. In scientific code you don't just do a few operations but lots of them, and the accumulated errors becomes so large that double is the only option. People have already done far more experiments than you think and comes to [the decision about a good number of bits for the floating-point formats for general use](https://retrocomputing.stackexchange.com/a/13496/1981) – phuclv Jan 26 '22 at 02:32
1

As I said doubles are not a disadvantage on a 64b cpu other than vector unit throughput. And you need to consider the algorithm being used first, if you don't you're just fumbling around blind with any type. Correct assessment of significant digits is covered in grade 10 chemistry. The float 7 digits are for a conservative convertion from decimal to float and back to decimal, which you only do once, for the input and final output.(The majority of numbers preserve 8 or 9) Internally a float has slightly more than precision than this. 7 what remains after cutting off typical rounding errors. – Max Power Jan 26 '22 at 03:09
So in brainfloat if I go 22 / 7 = do i get 3.1400 low res or do I get a less chunky 3.14200 or 3.14300 more accurately? And definitely not 3.14280 as is too much detail. – Tomachi Aug 24 '22 at 08:37
In my experience, doubles are what you want to stick with 99% of the time. Anything where precision could even remotely be considered an issue, you wanna stick with doubles. Floats still have their place, though, particularly in graphics, where floating point precision is still required, but only a certain amount of precision is necessary or even feasible given the pixel size. Then on the other end, we have machine learning, where high precision is not only unnecessary, but is actually a burden on the computational complexity. There, half-precision is the desired format. – Math Machine Mar 22 '23 at 00:41

T.J. Crowder · Accepted Answer · 2022-01-29T09:40:50.480

19

Re: Implementations: Someone has apparently written half for C, which would (of course) work in C++: https://storage.googleapis.com/google-code-archive-downloads/v2/code.google.com/cellperformance-snippets/half.c

Re: Why is float four bytes: Probably because below that, their precision is so limited. In IEEE-754, a "half" only has 11 bits of significand precision, yielding about 3.311 decimal digits of precision (vs. 24 bits in a single yielding between 6 and 9 decimal digits of precision, or 53 bits in a double yielding between 15 and 17 decimal digits of precision).

edited Jan 29 '22 at 09:40

answered Apr 23 '11 at 20:59

T.J. Crowder

1,031,962
187
1,923
1,875

4

Right. 10 bits = 3.01 decimal digits, which is inadequate for most number-crunching tasks. – dan04 Apr 23 '11 at 21:21
@dan04 It's 11, including the implicit one bit. – S.S. Anne Mar 17 '20 at 22:58
3

OK, 3.31 decimal digits. Not that it makes much of a difference. – dan04 Mar 17 '20 at 23:07
2

@dan04 It's a 10 bits representable difference. – Soleil Jan 05 '21 at 12:51

score 15 · Answer 3 · edited Aug 14 '20 at 02:17

15

If you're low on memory, did you consider dropping the float concept? Floats use up a lot of bits just for saving where the decimal point is. You can work around this if you know where you need the decimal point, let's say you want to save a Dollar value, you could just save it in Cents:

uint16_t cash = 50000;
std::cout << "Cash: $" << (cash / 100) << "." << ((cash % 100) < 10 ? "0" : "") << (cash % 100) << std::endl;

That is of course only an option if it's possible for you to predetermine the position of the decimal point. But if you can, always prefer it, because this also speeds up all calculations!

edited Aug 14 '20 at 02:17

phuclv

37,963
15
156
475

answered Apr 24 '11 at 07:49

Kira M. Backes

530
3
10

1

that is not correct what if cash = 402 you will print 42 – Et7f3XIV Feb 21 '19 at 23:35
2

@Et7f3XIV You are right, it's amazing how careless I answered on this page 8 years ago :( – Kira M. Backes Feb 25 '19 at 06:50
2

Or if you include header. You will able to code that way: ```std::cout << "Cash: $" << (cash / 100) << "." << std::setfill('0') << std::setw(2) << (cash % 100) << std::endl;``` – Et7f3XIV Feb 27 '19 at 22:41
3

it's called [fixed-point arithmetic](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) when you know where the radix point is – phuclv Oct 19 '20 at 23:51
1

Fixed point is essentially integer math with a superficial dot added. float16 has larger range than int16. There is tradeoff. An IEEE float16 reliably has about 3 significant decimal digits over the whole range, very small to huge, while an int16 is an exact index of count of 65536 units regardless of where you fix the point. The precision at the low end of int16 is one digit but it is known to be exactly accurate, and 5 digits at the high end. Where you need accuracy as a percent of the whole and a wide range use float, for an exact count like tracking inventory use int or fixed point. – Max Power Jan 26 '22 at 00:22

score 6 · Answer 4 · answered Apr 23 '11 at 21:20

6

There is an IEEE 754 standard for 16-bit floats.

It's a new format, having been standardized in 2008 based on a GPU released in 2002.

answered Apr 23 '11 at 21:20

dan04

87,747
23
163
198

2

Yes. He did mention `half` in his question. – T.J. Crowder Apr 23 '11 at 21:26

score 3 · Answer 5 · answered Jun 14 '12 at 16:18

To go a bit further than Kiralein on switching to integers, we could define a range and permit the integer values of a short to represent equal divisions over the range, with some symmetry if straddling zero:

short mappedval = (short)(val/range);

Differences between these integer versions and using half precision floats:

Integers are equally spaced over the range, whereas floats are more densely packed near zero
Using integers will use integer math in the CPU rather than floating-point. That is often faster because integer operations are simpler. Having said that, mapping the values onto an asymmetric range would require extra additions etc to retrieve the value at the end.
The absolute precision loss is more predictable; you know the error in each value so the total loss can be calculated in advance, given the range. Conversely, the relative error is more predictable using floating point.
There may be a small selection of operations which you can do using pairs of values, particularly bitwise operations, by packing two shorts into an int. This can halve the number of cycles needed (or more, if short operations involve a cast to int) and maintains 32-bit width. This is just a diluted version of bit-slicing where 32 bits are acted on in parallel, which is used in crypto.

score 2 · Answer 6 · edited Mar 11 '20 at 01:57

If your CPU supports F16C, then you can get something up and running fairly quickly with something such as:

// needs to be compiled with -mf16c enabled
#include <immintrin.h>
#include <cstdint>

struct float16
{
private:
  uint16_t _value;
public:

  inline float16() : _value(0) {}
  inline float16(const float16&) = default;
  inline float16(float16&&) = default;
  inline float16(const float f) : _value(_cvtss_sh(f, _MM_FROUND_CUR_DIRECTION)) {}

  inline float16& operator = (const float16&) = default;
  inline float16& operator = (float16&&) = default;
  inline float16& operator = (const float f) { _value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION); return *this; }

  inline operator float () const 
    { return _cvtsh_ss(_value); }

  inline friend std::istream& operator >> (std::istream& input, float16& h) 
  { 
    float f = 0;
    input >> f;
    h._value = _cvtss_sh(f, _MM_FROUND_CUR_DIRECTION);
    return input;
  }
};

Maths is still performed using 32-bit floats (the F16C extensions only provides conversions between 16/32-bit floats - no instructions exist to compute arithmetic with 16-bit floats).

This can be done without `immintrin.h`. See this answer: https://stackoverflow.com/a/64493446/1413259 — wolfram77, May 12 '21 at 13:20
@wolfram77 Fairly sure what you linked is about bfloat16, whereas this answer here is about half floats. — Tara, Jun 03 '23 at 14:28

score 1 · Answer 7 · edited Mar 17 '20 at 08:11

1

There are probably a variety of types in different implementations. A float equivalent of stdint.h seems like a good idea. Call (alias?) the types by their sizes. (float16_t?) A float being 4 bytes is only right now, but it probably won't get smaller. Terms like half and long mostly become meaningless with time. With 128 or 256-bit computers they could come to mean anything.

I'm working with images (1+1+1 byte/pixel) and I want to express each pixel's value relative to the average. So floating point or carefully fixed point, but not 4 times as big as the raw data please. A 16-bit float sounds about right.

This GCC 7.3 doesn't know "half", maybe in a C++ context.

edited Mar 17 '20 at 08:11

phuclv

37,963
15
156
475

answered Jun 13 '18 at 01:51

Alan Corey

577
6
10

128 and 256b processing is a specialty domain that is unlikely to see much of a market in general computing, with a possible exception of a single long number unit within an otherwise 64bit CPU. Anyway "long double" and "long long int" are already reserved in C++ [presumably for 128bit] though most compilers currently set them as duplicate 64bit types or x87 80bit float on x86_64 machines. long double is not to be confused with "double double math" which is two 64b floats mashed together (Slightly faster processing than using software implemented arbitrary precision math.). – Max Power Jan 26 '22 at 00:37
Mainframe CPUs have been between 32 and 64bit since the vacuum tube days. 8 and 16 were only used for low cost or low power consumption. Very few use cases need more than 7 significant digits of precision(32bit). 64b floats ~15 sig digits (x87 unit takes 64bit input, uses 80bit internally and returns 64bit for 19 sig digits ) 128-256b computations are very niche. 64bit address space is unlikely to be exceeded in a single machine for operational reasons and 128bit for elementary physics limitations. 8*(2^128) silicon atoms [number of bits in 128bit address space] weighs 130 tons – Max Power Jan 26 '22 at 01:32
@MaxPower are you sure? [The first 64-bit computer was released in 1961](https://en.wikipedia.org/wiki/Word_(computer_architecture)#Table_of_word_sizes), far later than the vacuum tube era. And `"long long int" are already reserved in C++ [presumably for 128bit]` is absolutely wrong. `long long` is already there since C++11 and has at least 64 bits – phuclv Jan 26 '22 at 02:37
1

@phuclv You need to work on comprehending what you reply to before posting. Yes, 128bits is at least 64bits, ask anyone the math really works. `if(128>=64)std::cout<<"True\n"; else std::cout<<"False\n";` ENIAC was decimal in hardware and could calculate 10 or 20 decimal digit numbers. (This is a little better than 40bit and 80bit binary); EDVAC used 44bit words; SWAC used 37bit words with both single or double precision(74bit) ; EDSAC 34 bit using two 17bit words ; Manchester Mark 1 used 40bit numbers 20 bit instructions; MEG/Mercury floating-point unit used 40bit, 30mantissa 10exponent – Max Power Jan 26 '22 at 04:24

score 1 · Answer 8 · edited Feb 17 '22 at 17:40

1

2 byte float is available in clang C compiler , The data type is represented as __fp16.

edited Feb 17 '22 at 17:40

Dharman

30,962
25
85
135

answered Feb 17 '22 at 17:34

Sundar Santhanam

170
1
8

A Fog · Answer 9 · 2022-06-03T19:10:54.570

Various compilers now support three different half precision formats:

__fp16 is mostly used as a storage format. It is promoted to float as soon as you do calculations on it. Calculations on __fp16 will give a float result. __fp16 has 5 bits exponent and 10 bits mantissa.
_Float16 is the same as __fp16, but used as an interchange and arithmetic format. Calculations on _Float16 will give a _Float16 result.
__bf16 is a storage format with less precision. It has 8 bits exponent and 7 bits mantissa.

All three types are supported by compilers for the ARM architecture and now also by compilers for x86 processors. The AVX512_FP16 instruction set extension will be supported by Intel's forthcoming Golden Cove processors and it is supported by the latest Clang, Gnu, and Intel compilers. Vectors of _Float16 are defined as __m128h, __m256h, and __m512h on compilers that support AVX512_FP16 .

References:

https://developer.arm.com/documentation/100067/0612/Other-Compiler-specific-Features/Half-precision-floating-point-data-types

https://clang.llvm.org/docs/LanguageExtensions.html#half-precision-floating-point

Why is there no 2-byte float and does an implementation already exist?

9 Answers9

TL;DR: 16-bit floats do exist and there are various software as well as hardware implementations

Linked

Related