Why is operating on Float64 faster than Float16?

Question

I wonder why operating on Float64 values is faster than operating on Float16:

julia> rnd64 = rand(Float64, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @benchmark rnd64.^2
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.800 μs … 662.140 μs  ┊ GC (min … max):  0.00% … 99.37%
 Time  (median):     2.180 μs               ┊ GC (median):     0.00%
 Time  (mean ± σ):   3.457 μs ±  13.176 μs  ┊ GC (mean ± σ):  12.34% ±  3.89%

  ▁██▄▂▂▆▆▄▂▁ ▂▆▄▁                                     ▂▂▂▁   ▂
  ████████████████▇▇▆▆▇▆▅▇██▆▆▅▅▆▄▄▁▁▃▃▁▁▄▁▃▄▁▃▁▄▃▁▁▆▇██████▇ █
  1.8 μs       Histogram: log(frequency) by time      10.6 μs <

 Memory estimate: 8.02 KiB, allocs estimate: 5.

julia> @benchmark rnd16.^2
BenchmarkTools.Trial: 10000 samples with 6 evaluations.
 Range (min … max):  5.117 μs … 587.133 μs  ┊ GC (min … max): 0.00% … 98.61%
 Time  (median):     5.383 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   5.716 μs ±   9.987 μs  ┊ GC (mean ± σ):  3.01% ±  1.71%

    ▃▅█▇▅▄▄▆▇▅▄▁             ▁                                ▂
  ▄██████████████▇▆▇▆▆▇▆▇▅█▇████▇█▇▇▆▅▆▄▇▇▆█▇██▇█▇▇▇▆▇▇▆▆▆▆▄▄ █
  5.12 μs      Histogram: log(frequency) by time      7.48 μs <

 Memory estimate: 2.14 KiB, allocs estimate: 5.

Maybe you ask why I expect the opposite: Because Float16 values have less floating point precision:

julia> rnd16[1]
Float16(0.627)

julia> rnd64[1]
0.4375452455597999

Shouldn't calculations with fewer precisions take place faster? Then I wonder why someone should use Float16? They can do it even with Float128!

There's hardware support for 32 & 64, but I think Float16 is converted before most operations: https://docs.julialang.org/en/v1/manual/integers-and-floating-point-numbers/#Floating-Point-Numbers . On ARM processors (like an M1 mac) there is some support, e.g. `@btime $(similar(rnd16)) .= 2 .* $rnd16;` is faster than 64. This is quite recent, see e.g. https://github.com/JuliaLang/julia/issues/40216 — mcabbott, Dec 06 '22 at 14:20
@mcabbott, I somewhat guessed the conversion possibility in my mind. Thank you so much! — Shayan, Dec 06 '22 at 15:08
What CPU do you have? If it's x86, does it have [AVX512-FP16](https://en.wikipedia.org/wiki/AVX-512#FP16) for direct support of fp16 without conversion, scalar and SIMD? (Sapphire Rapids and newer, and probably Alder Lake with unlocked AVX-512, unfortunately not Zen 4.) If not, most x86 CPUs for the last decade have instructions for packed conversion between fp16 and fp32, but that's it. [Half-precision floating-point arithmetic on Intel chips](https://stackoverflow.com/q/49995594). If your CPU doesn't even have F16C, it would take multiple instructions to convert. — Peter Cordes, Dec 07 '22 at 03:06
Half precision floats are often used to save memory, not speed. — Olivier Jacot-Descombes, Dec 12 '22 at 17:19

score 47 · Accepted Answer · answered Dec 06 '22 at 15:09

As you can see, the effect you are expecting is present for Float32:

julia> rnd64 = rand(Float64, 1000);

julia> rnd32 = rand(Float32, 1000);

julia> rnd16 = rand(Float16, 1000);

julia> @btime $rnd64.^2;
  616.495 ns (1 allocation: 7.94 KiB)

julia> @btime $rnd32.^2;
  330.769 ns (1 allocation: 4.06 KiB)  # faster!!

julia> @btime $rnd16.^2;
  2.067 μs (1 allocation: 2.06 KiB)  # slower!!

Float64 and Float32 have hardware support on most platforms, but Float16 does not, and must therefore be implemented in software.

Note also that you should use variable interpolation ($) when micro-benchmarking. The difference is significant here, not least in terms of allocations:

julia> @btime $rnd32.^2;
  336.187 ns (1 allocation: 4.06 KiB)

julia> @btime rnd32.^2;
  930.000 ns (5 allocations: 4.14 KiB)

x86 since Ivy Bridge has had hardware support for conversion between FP16 and FP32, `VCVTPH2PS YMM, XMM` or `VCVTPH2PS YMM, mem` is still 2 uops on Intel. And converting back with a memory or register destination is 4 or 3 uops on Haswell (which is what that OP's 2013 CPU might be, or might be Ivy Bridge.) It the conversion uops also compete for limited back-end ports, port 1 both directions on Ivy Bridge and Haswell, plus the shuffle port (port 5) except for the memory-source version. It's an AVX instruction; IDK if Julia would use it automatically. — Peter Cordes, Dec 07 '22 at 03:28

Oscar Smith · Answer 2 · 2022-12-06T15:11:25.263

20

The short answer is that you probably shouldn't use Float16 unless you are using a GPU or an Apple CPU because (as of 2022) other processors don't have hardware support for Float16.

edited Dec 06 '22 at 15:11

answered Dec 06 '22 at 14:33

Oscar Smith

5,766
1
20
34

4

@JUL: Support didn't exist 9 years ago either. – user2357112 Dec 07 '22 at 00:57
13

Not quite true that no other CPUs have support: Alder Lake with unlocked AVX-512 has AVX512-FP16 for have scalar and packed-SIMD support for FP16 (not just BF16). Also Sapphire Rapids Xeon, although that hasn't officially launched yet. See https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 for a table of extensions by CPU. And [Half-precision floating-point arithmetic on Intel chips](https://stackoverflow.com/q/49995594). But yes, no mainstream x86 CPUs with a launch date before 2023 have officially supported FP16 on the CPU, only iGPU. – Peter Cordes Dec 07 '22 at 03:16
3

I wouldn't say that you *shouldn't* use Float16 on other hardware. In a specialized circumstance where you're doing a bunch of number crunching, and don't require numbers bigger than 65504, don't require more than 3 decimal digits of precision, and don't require maximizing CPU speed, *but* you have *massive* arrays of these numbers and memory is at a premium, then using Float16 would be a useful optimization. OTOH, if you don't need a lot of memory but do need speed or accuracy, use Float64. – dan04 Dec 07 '22 at 23:24
Yeah, there are technically places where it can be useful, but there usually is some other form of memory consumption that will be faster at that point. – Oscar Smith Dec 08 '22 at 05:15

Why is operating on Float64 faster than Float16?

2 Answers2