Can float16 data type save compute cycles while computing transcendental functions?

Question

it's clearly that float16 can save bandwidth, but is float16 can save compute cycles while computing transcendental functions, like exp()?

If your hardware has full support for it, not just conversion to float32, then yes. Otherwise no. Although if you know you only need 16-bit precision, you could use a faster `exp` function with fewer polynomial terms for the mantissa, even though you're computing with float32. (exp and log are interesting for floating point because the format itself is build around an exponential representation. But normally you do want some FP operations, so I don't think you'd get anything useful with only 16-bit integer operations, without unpacking to float32). — Peter Cordes, Dec 15 '22 at 04:06
Assuming there are ALUs support fp16, i'm curious about that if ALU support fp16 faster than ALU only support fp32? Generally, the transcendental function is realized by Newton iteration or Taylor expansion, i'm not sure if ALU support fp16 can make this faster or not? — Leonardo Physh, Dec 15 '22 at 04:22
Oh, well yeah of course! If you can do two 64-byte vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs instead of 16 single-precision FMAs. — Peter Cordes, Dec 15 '22 at 04:24
yeah, that's right, it achived by feeding more data into vector at the same time(wider SIMD). but if digging deeper, from single data perspective, if ALU that support fp16 can save compute cycle? — Leonardo Physh, Dec 15 '22 at 04:37
What do you mean "from single data perspective"? Like in non-vectorizable code? Only via not needing as much precision, assuming that with fp32 you'd use log/exp functions that came close to full precision for its 24-bit mantissa. On CPUs with hardware support for FP16, a scalar FP16 add/mul/FMA instruction has the same performance as a scalar FP32 add/mul/FMA. (check `vmulsh` vs. `vmulss` or `vmulsd` on https://uops.info/. `vdivsh` has worse latency than `vdivss` on Alder Lake.) — Peter Cordes, Dec 15 '22 at 04:42
Ok, but weird to tag this HPC, then; the default assumption in HPC is that you're going to find a way to vectorize your code, especially if you're mentioning memory bandwidth. x86 doesn't even have scalar conversion from f16 to f32, only packed, so you'd have to load 2 bytes into the bottom of a vector somehow (or load 4 bytes starting at your f16, if you're sure that can't extend into an unmapped page and fault.) That changes with AVX-512 F16: https://en.wikipedia.org/wiki/AVX-512#FP16 `vmovw` and `vmovsh`, and scalar conversion. — Peter Cordes, Dec 15 '22 at 06:22

score 1 · Accepted Answer · answered Dec 15 '22 at 04:23

If your hardware has full support for it, not just conversion to float32, then yes, definitely. e.g. on a GPU, or on Intel Alder Lake with AVX-512 enabled, or Sapphire Rapids. Half-precision floating-point arithmetic on Intel chips. Or apparently on Apple M2 CPUs.

If you can do two 64-byte SIMD vectors of FMAs per clock on a core, you go twice as fast if that's 32 half-precision FMAs per vector instead of 16 single-precision FMAs.

Speed vs. precision tradeoff: only enough for FP16 is needed

Without hardware ALU support for FP16, only by not requiring as much precision because you know you're eventually going to round to fp16. So you'd use polynomial approximations of lower degree, thus fewer FMA operations, even though you're computing with float32.

BTW, exp and log are interesting for floating point because the format itself is build around an exponential representation. So you can do an exponential by converting fp->int and stuffing that integer into the exponent field of an FP bit pattern. Then with the the fractional part of your FP number, you use a polynomial approximation to get the mantissa of the exponent. A log implementation is the reverse: extract the exponent field and use a polynomial approximation of log of the mantissa, over a range like 1.0 to 2.0.

See

Normally you do want some FP operations, so I don't think it would be worth trying to use only 16-bit integer operations to avoid unpacking to float32 even for exp or log, which are somewhat special and intimately connected with floating point's significand * 2^exponent format, unlike sin/cos/tan or other transcendental functions.

So I think your best bet would normally still be to start by converting fp16 to fp32, if you don't have instructions like AVX-512 FP16 can do actual FP math on it. But you can gain performance from not needing as much precision, since implementing these functions normally involves a speed vs. precision tradeoff.

Can float16 data type save compute cycles while computing transcendental functions?

1 Answers1

Speed vs. precision tradeoff: only enough for FP16 is needed