Questions tagged [half-precision-float]

half-precision 16-bit floating point

Most uses of 16-bit floating point are the binary16 aka half-precision floating point format, but other formats with different choices of exponent vs. significand bits are possible.

(However, related formats like Posit that have similar uses but a different binary format are not covered by this tag)

The tag wiki has links to more info, and lists other tags. (This tag was temporarily a synonym of , but should stay separate because half-precision is less widely implemented than float / binary32 and double / binary64.)


16-bit floating point has less precision (mantissa aka significand bits) and less range (exponent bits) than the widely used 32-bit single-precision IEEE754 binary32 float or 64-bit binary64 double. But it takes less space, reducing memory bandwidth requirements, and on some GPUs has better throughput.

It's fairly widely supported on GPUs, but on x86 CPUs at least, support is limited to conversion to/from float. (And only on CPUs that support AVX and the F16C extension, e.g. Intel starting with IvyBridge.)

If any CPU SIMD extension supported math on half-precision directly, it would have twice the elements per SIMD vector and thus twice the throughput of float for vectorizable tasks. But such support is not widespread in 2020 if it exists at all.

70 questions
49
votes
9 answers

Why is there no 2-byte float and does an implementation already exist?

Assuming I am really pressed for memory and want a smaller range (similar to short vs int). Shader languages already support half for a floating-point type with half the precision (not just convert back and forth for the value to be between -1 and…
Samaursa
  • 16,527
  • 21
  • 89
  • 160
38
votes
2 answers

Why is operating on Float64 faster than Float16?

I wonder why operating on Float64 values is faster than operating on Float16: julia> rnd64 = rand(Float64, 1000); julia> rnd16 = rand(Float16, 1000); julia> @benchmark rnd64.^2 BenchmarkTools.Trial: 10000 samples with 10 evaluations. Range (min ……
Shayan
  • 5,165
  • 4
  • 16
  • 45
19
votes
2 answers

Half-precision floating-point arithmetic on Intel chips

Is it possible to perform half-precision floating-point arithmetic on Intel chips? I know how to load/store/convert half-precision floating-point numbers [1] but I do not know how to add/multiply them without converting to single-precision…
Kadir
  • 1,345
  • 3
  • 15
  • 25
17
votes
3 answers

How to enable __fp16 type on gcc for x86_64

The __fp16 floating point data-type is a well known extension to the C standard used notably on ARM processors. I would like to run the IEEE version of them on my x86_64 processor. While I know they typically do not have that, I would be fine with…
Nonyme
  • 1,220
  • 1
  • 11
  • 22
7
votes
1 answer

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

how will you decide what precision works best for your inference model? Both BF16 and F16 takes two bytes but they use different number of bits for fraction and exponent. Range will be different but I am trying to understand why one chose one over…
5
votes
3 answers

How to correctly determine at compile time that _Float16 is supported?

I am trying to determine at compile time that _Float16 is supported: #define __STDC_WANT_IEC_60559_TYPES_EXT__ #include #ifdef FLT16_MAX _Float16 f16; #endif Invocations: # gcc trunk on linux on x86_64 $ gcc -std=c11 -pedantic -Wall…
pmor
  • 5,392
  • 4
  • 17
  • 36
4
votes
1 answer

float.h-like definitions for IEEE 754 binary16 half floats

I'm using half floats as implemented in the SoftFloat library (read: 100% IEEE 754 compliant), and, for the sake of completeness, I wish to provide my code with definitions equivalent to those available in for float, double, and long…
cesss
  • 852
  • 1
  • 6
  • 15
4
votes
1 answer

Why does bfloat16 have so many exponent bits?

It's clear why a 16-bit floating-point format has started seeing use for machine learning; it reduces the cost of storage and computation, and neural networks turn out to be surprisingly insensitive to numeric precision. What I find particularly…
4
votes
1 answer

GCC: why cannot compile clean printf("%f\n", f16) under -std=c11 -Wall?

Sample code: #include #define __STDC_WANT_IEC_60559_TYPES_EXT__ #include #ifdef FLT16_MAX _Float16 f16; int main(void) { printf("%f\n", f16); return 0; } #endif Invocation: # gcc trunk on linux on x86_64 $ gcc t0.c…
pmor
  • 5,392
  • 4
  • 17
  • 36
4
votes
1 answer

Why is half-precision complex float arithmetic not supported in Python and CUDA?

NumPY has complex64 corresponding to two float32's. But it also has float16's but no complex32. How come? I have signal processing calculation involving FFT's where I think I'd be fine with complex32, but I don't see how to get there. In…
Lars Ericson
  • 1,952
  • 4
  • 32
  • 45
4
votes
1 answer

Populating MTLBuffer with 16-bit Floats

I am populating an MTLBuffer with float2 vectors. The buffer is being created and populated like this: struct Particle { var position: float2 ... } let particleCount = 100000 let bufferSize = MemoryLayout.stride *…
Jeshua Lacock
  • 5,730
  • 1
  • 28
  • 58
4
votes
0 answers

Tensorflow automatic mixed precision fp16 slower than fp32 on official resnet

I am trying to use the official ResNet model benchmarks from https://github.com/tensorflow/models/blob/master/official/resnet/estimator_benchmark.py#L191 to experiment with the AMP support included in tensorflow-gpu==1.14.0rc0. I'm running on a 2080…
Eli Stevens
  • 111
  • 6
3
votes
0 answers

How to verify if the tensorflow code trains completely in FP16?

I'm trying to train a TensorFlow (version 2.11.0) code in float16. I checked that FP16 is supported on the RTX 3090 GPU. So, I followed the below link to train the whole code in reduced…
3
votes
2 answers

Detecting support for __fp16

Since version 6, clang has supported a __fp16 type. I would like to use it, but I need to support other compilers (both clang-based and non-clang-based) as well as older versions of clang, so I need a reliable way to detect support. Unfortunately,…
nemequ
  • 16,623
  • 1
  • 43
  • 62
3
votes
1 answer

FLT_MAX for half floats

I am using CUDA with half floats, or __half as they are called in CUDA. What is the half-float equivalent of FLT_MAX? The cuda_fp16.h header does not seem to have a macro that resembles this. $ grep MAX…
Bram
  • 7,440
  • 3
  • 52
  • 94
1
2 3 4 5