Using Half Precision Floating Point on x86 CPUs

Question

I intend to use half-precision floating-point in my code but I am not able to figure out how to declare them. For Example, I want to do something like the following-

fp16 a_fp16;
bfloat a_bfloat;

However, the compiler does not seem to know these types (fp16 and bfloat are just dummy types, for demonstration purposes)

I remember reading that bfloat support was added into GCC-10, but I am not able to find it in the manual.I am especially interested in bfloat floating numbers

Additional Questions -

FP16 now has hardware support on Intel / AMD support as today? I think native hardware support was added since Ivy Bridge itself. (https://scicomp.stackexchange.com/questions/35187/is-half-precision-supported-by-modern-architecture)
I wanted to confirm whether using FP16 will indeed increase FLOPs. I remember reading somewhere that all arithmetic operations on fp16 are internally converted to fp32 first, and only affect cache footprint and bandwidth.
SIMD intrinsic support for half precision float, especially bfloat(I am aware of intrinsics like _mm256_mul_ph, but not sure how to pass the 16bit FP datatype, would really appreciate if someone could highlight this too)
Are these types added to Intel Compilers as well ?

PS - Related Post - Half-precision floating-point arithmetic on Intel chips , but it does not cover on declaring half precision floating point numbers.

TIA

eerorika · Accepted Answer · 2021-12-22T08:58:17.600

2

Neither C++ nor C language has arithmetic types for half floats.

The GCC compiler supports half floats as a language extension. Quote from the documentation:

On x86 targets with SSE2 enabled, GCC supports half-precision (16-bit) floating point via the _Float16 type. For C++, x86 provides a builtin type named _Float16 which contains same data format as C.

...

On x86 targets with SSE2 enabled, without -mavx512fp16, all operations will be emulated by software emulation and the float instructions. The default behavior for FLT_EVAL_METHOD is to keep the intermediate result of the operation as 32-bit precision. This may lead to inconsistent behavior between software emulation and AVX512-FP16 instructions. Using -fexcess-precision=16 will force round back after each operation.

Using -mavx512fp16 will generate AVX512-FP16 instructions instead of software emulation. The default behavior of FLT_EVAL_METHOD is to round after each operation. The same is true with -fexcess-precision=standard and -mfpmath=sse. If there is no -mfpmath=sse, -fexcess-precision=standard alone does the same thing as before, It is useful for code that does not have _Float16 and runs on the x87 FPU.

edited Dec 22 '21 at 08:58

answered Dec 22 '21 at 08:53

eerorika

232,697
12
197
326

Thanks for your answer. I suppose Intel does support bf16 (https://www.intel.com/content/dam/develop/external/us/en/documents/bf16-hardware-numerics-definition-white-paper.pdf, ) – Atharva Dubey Dec 22 '21 at 09:00
@AtharvaDubey Yeah. I was mistaken due to different naming (bf vs bfloat; I guess they're the same thing) – eerorika Dec 22 '21 at 09:01
@AtharvaDubey Intel does support half-precision FP in instructions [VCVTPS2PH — Convert Single-Precision FP value to 16-bit FP value](https://www.felixcloutier.com/x86/vcvtps2ph) and [VCVTPH2PS — Convert 16-bit FP values to Single-Precision FP values](https://www.felixcloutier.com/x86/vcvtph2ps). – vitsoft Dec 22 '21 at 09:19
@vitsoft, I wish to use fp16 because I want to increase the FLOPS, I want to use fp16 right from the start. Converting it into fp16 first is an additional step. – Atharva Dubey Dec 22 '21 at 09:31
2

@AtharvaDubey: Intel "supports" storage of half-precision, but you can't actually do anything with them (add them, multiply them, ...). The only thing this "support" is good for is packing more stuff into memory (at the added expense of converting them to/from single precision when you actually need to do anything). If you assume conversion from one floating point format to another is a "floating point operation"; then you could increase FLOPS by continually converting back and forth without getting any useful work done. – Brendan Dec 22 '21 at 09:38
1

@Brendan: That changed with Cooper Lake, which has HW support for BF16. Apparently it launched in mid 2020, with limited release. https://en.wikichip.org/wiki/intel/microarchitectures/cooper_lake. Although it seems Ice Lake Xeon doesn't support it: https://www.anandtech.com/show/15686/intel-updates-isa-manual-new-instructions-for-alder-lake-also-bf16-for-sapphire-rapids is from 2019, but newer posts like wikichip still don't list it. But it's coming back again in Sapphire Rapids Xeon (but probably not alder lake), so yeah, Intel's playing coy with it. – Peter Cordes Dec 22 '21 at 09:51
1

Also [Is half precision supported by modern architecture?](https://scicomp.stackexchange.com/q/35187) has links about Sapphire Rapids. Yeah, as that Anandtech article commented, ugh fragmentation. If you get your hands on a server or cluster with it, great, but there's no sign yet of it getting into mainstream HW yet. Re: the usefulness: if you're limited by memory bandwidth, which is a big deal on many-core servers, it can help. It trades some extra ALU work for higher computational intensity of real work (FLOP / Byte) – Peter Cordes Dec 22 '21 at 09:53
@PeterCordes: If a tree falls in the woods but nobody is around to hear it, does it make a sound? If Intel releases a CPU for the 4-socket and 8-socket market that no normal person will ever see, does Cooper Lake actually exist? ;-) – Brendan Dec 22 '21 at 10:35
@Brendan: Yeah, pretty much agreed; I hadn't realized *how* limited Cooper Lake was when I wrote the first of those two comments. Also, I was mixing up AVX-512 BF16 (brain-float, just convert and dot-product-into-float32 in HW) with FP16 (IEEE binary16); it seems Sapphire Rapids will have FP16 *and* BF16 (https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512). (https://www.phoronix.com/scan.php?page=news_item&px=AFX-512-FP16-GCC-Patches). And wikipedia implies Alder Lake P-cores will have FP16 and BF16, which may or may not be true, although I think they're on similar uarches. – Peter Cordes Dec 23 '21 at 08:13
Also, Wikipedia says Zen4 will have BF16 (but not FP16). – Peter Cordes Dec 23 '21 at 08:14
@PeterCordes: For Alder Lake P cores; I'd expect that BF16 support technically exists in silicon, but is disabled and unusable by software (the same as AVX-512) unless motherboard does "not officially supported by Intel" work-arounds to re-enable disabled instruction set extensions in P cores if/when E cores are disabled.. If this is the case; then it's neither supported nor unsupported (but for practical purposes it's closer to "unsupported" and safer to assume "unsupported"). – Brendan Dec 23 '21 at 08:57
@Brendan: Oh, I didn't realize that bios option was unofficial. :( That's unfortunate. https://www.anandtech.com/show/17047/the-intel-12th-gen-core-i912900k-review-hybrid-performance-brings-hybrid-complexity/2 confirms that when AVX-512 is enabled on an Alder Lake system, it does have BF16 and FP16, same as Sapphire Rapids. (Also with 2/clock 512-bit FMA, so that's a lot of die area to not officially support.) And has some details on how unsupported it is. From that, it sounds like even the P-core-only chips won't default to AVX-512 enabled out of the box. :( – Peter Cordes Dec 23 '21 at 09:07
@PeterCordes: As I understand it, Intel originally wanted AVX-512 to work (when E cores are disabled) but changed their mind; and some motherboard manufacturers took functionality (to enable it) from pre-release firmware blobs and patched it into later/released firmware blobs. I don't know why Intel changed their mind, or if Intel validated AVX-512 (or BF16, etc). so it's possible that Intel changed their mind because its full of errata. Early adopters say it works (but most software doesn't use it and often errata is like "under these very specific conditions..", so that doesn't mean much). – Brendan Dec 23 '21 at 09:55
@Brendan: Yeah, that was my impression from the Anandtech article. As for errata, it helps that it's the supposedly the same core microarchitecture as Sapphire Rapids. And it sounds like Intel management didn't pull the "official support" rug out from under this until fairly late in the design process, so there's good reason for optimism about it working, unless the problem conditions involve power delivery / CPU frequency changes or something, or some key changes after Sapphire Rapids forked away from Alder Lake or vice versa are relevant. – Peter Cordes Dec 23 '21 at 10:15

Using Half Precision Floating Point on x86 CPUs

1 Answers1