How to analyze the instructions pipelining on Zen4 for AVX-512 packed double computations? (backend bound)

Question

I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.

AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:

Method                          Runtime (minutes:seconds)
HarmonicSeriesPlain             0:41.33
HarmonicSeriesAVX256            0:10.32
HarmonicSeriesAVX512            0:09.82

I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:

Method                  Runtime (minutes:seconds)
div_plain               1:53.80
div_avx256f             0:28.47
div_avx512f             0:14.25

Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.

However, I need help finding out more details.

The analysis with the llvm-mca (LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:

gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence. 
note: instruction:     vdivpd %zmm0, %zmm4, %zmm2

On the Intel platform, I would use perf stat -M pipeline binary to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:

cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3

and got the results here.

From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all ( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?

https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.) — Peter Cordes, Nov 21 '22 at 08:33
Division isn't fully pipelined on any modern x86 CPU, and even on Intel CPUs 512-bit `vdivpd zmm` has about the same doubles-per-clock throughput as `vdivpd ymm` — Peter Cordes, Nov 21 '22 at 08:38
Peter, thanks a lot for the link to instructing timing tables! I did more measurements and compared Intel Icelake against AMD Zen4. AVX division on Zen4 is 2x faster than on Icelake, while other packed double operations run at similar speed. I have concluded that Icelake has only 2 256-bits wide units for division, while Zen4 has 4 256-bits wide div units. Compare the results for [Zen4](https://github.com/jirka-h/AVX512/blob/main/results/AMD_EPYC_9654_96-Core_Processor/results.txt) and [Icelake](https://github.com/jirka-h/AVX512/blob/main/results/Intel_Platinum_8351N_CPU_2.40GHz/results.txt) — Jirka, Nov 24 '22 at 23:20
Agner Fog measured one `vdivpd ymm` (4 doubles) per 5 clocks on Zen4,with performance counter measuring it dispatching to ports 0 or 1. Or 8 doubles per 9 clocks, slight speedup with AVX-512 actually. Differences in throughput vs. Ice Lake are also in how heavily pipelined the divide unit is; e.g. Ice Lake's is one YMM per 8 clocks on port 0 only. (But unlike Skylake, doesn't compete with integer division). And yeah, it's only 256-bit wide. — Peter Cordes, Nov 25 '22 at 00:21

score 2 · Answer 1 · answered Nov 22 '22 at 11:10

2

Zen 4 execution units are mostly 256-bit wide; handling a 512-bit uop occupies it for 2 cycles. It's normal that 512-bit vectors don't have more raw throughput for any math instructions in general on Zen 4. Although using them on Zen4 does mean more work per uop so out-of-order exec has an easier time.

Or in the case of division, they're occupied for longer since division isn't fully pipelined, like on all modern CPUs. Division is hard to implement.

On Intel Ice Lake for example, divpd throughput is 2 doubles per 4 clocks whether you're using 128-bit, 256-bit, or 512-bit vectors. 512-bit takes extra uops, so we can infer that the actual divider execution unit is 256-bit wide in Ice Lake, but that divpd xmm can use the two halves of it independently. (Unlike AMD).

https://agner.org/optimize/ has instructing timing tables (and his microarch PDF has details on how CPUs work that are essential to making sense of them). https://uops.info/ also has good automated microbenchmark results, free from typos and other human error except sometimes in choosing what to benchmark. (But the actual instruction sequences tested are available, so you can check what they actually tested.) Unfortunately they don't yet have Zen 4 results up, only up to Zen 3.

Zen4 has execution units 256-bit wide for the most part, so 512-bit instructions are single uop but take 2 cycles on most execution units. (Unlike Zen1 where they took 2 uops and thus hurt OoO exec). And it has efficient 512-bit shuffles, and lets you use the power of new AVX-512 instructions for 256-bit vector width, which is where a lot of the real value is. (Better shuffles, masking, vpternlogd, vector popcount, etc.)

Division isn't fully pipelined on any modern x86 CPU. Even on Intel CPUs 512-bit vdivpd zmm has about the same doubles-per-clock throughput as vdivpd ymm (Floating point division vs floating point multiplication has some older data on the YMM vs. XMM situation which is similar, although Zen4 apparently can't send different XMM vectors through the halves of its 256-bit-wide divide unit; vdivpd xmm has the same instruction throughput as vdivpd ymm)

Fast-reciprocal + Newton iterations

For something that's almost entirely bottlenecked on division throughput (not front-end or other ports), you might consider approximate-reciprocal with a Newton-Raphson iteration or two to refine the accuracy to close to 1 ulp. (Not quite the 0.5 ulp you'd get from exact division).

AVX-512 has vrcp14pd approx-reciprocal for packed-double. So two rounds of Newton iterations should double the number of correct bits each time, to 28 then 56 (which is more than the 53-bit mantissa of a double). Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision mostly talks about rsqrt, but similar idea.

SSE/AVX1 only had single-precision versions of the fast-reciprocal and rsqrt instructions, with only 12-bit precision. e.g. rcpps.

AVX-512ER has 28-bit precision versions, but only Xeon Phi ever had those; mainstream CPUs haven't included them. (Xeon Phi had very vdivps / pd exact division, so it was much better to use the reciprocals.)

answered Nov 22 '22 at 11:10

Peter Cordes

328,167
45
605
847

Hi Peter, thanks a lot for your answer! approx-reciprocal was very useful. Here is my [implementation](https://github.com/jirka-h/harmonic_series/blob/main/harmonic_series.cpp#L101). The effect on performance varies between Intel Icelake and AMD Zen4 **dramatically** - runtime for the sum of *9.6e11* terms of harmonic series went down from *409* to *196* seconds on Icelake and from *292* to *256* seconds on AMD4. Please note that Icelake has just two 256-bits wide AVX divider units, while Zen4 has 4. On Icelake, moving to approx division means you increase the number of usable AVX units. – Jirka Dec 05 '22 at 00:06
@Jirka: Ice Lake has one 256-bit divide unit on port 0, not two. Zen4 seems to have two, on P0 and P1, per Agner Fog's testing. But yes, the throughput ratio between `vdivpd` and `vaddpd` or `vrsqrtps` or `vrsqrt14pd` differs by microarchitecture. The divider unit isn't fully pipelined on any CPU. – Peter Cordes Dec 05 '22 at 00:18
@Jirka: If you're going to use `float` `vrcpss`, you should maybe use a `Vec8f` of counter values in the first place. So you only need one conversion to double, of the result after one Newton iteration. Or two conversions in parallel, of the original and the first Newton iteration result, if you want to widen farther. On Zen4, using 32-byte vectors for more of the work will gain throughput. You may only need the extra precision of `double` when adding the small reciprocals to the relatively large accumulators, and one Newton iteration only has about as much precision as a `float` anyway. – Peter Cordes Dec 05 '22 at 00:24
Especially if you're using standard `vrcpps` (12-bit precision), not AVX-512 `vrcp14ps` which is also available for `double` as `vrcp14pd` - that would get you to 28-bit precision, more than a single-precision float. – Peter Cordes Dec 05 '22 at 00:25
Interesting discovery: the approximation can have an **unexpected** effect and **increase** runtime significantly. Consider computing `a=b/a`in a loop. This computation completely breaks the pipelining as the following loop cannot be pre-computed. In this case, **runtime went up** from 19s for the standard division to 33s with approximation on Intel Icelake. On Zen4, the runtime was 14s for normal division and 38s for division using the approximation. Another aspect to notice is that `c/d` can be different from `c*(1/d)` even if the reciprocal is computed precisely. – Jirka Dec 05 '22 at 02:37
@Jirka: Is this with multiple Newton iterations? As my answer on [Fast vectorized rsqrt and reciprocal with SSE/AVX depending on precision](https://stackoverflow.com/q/31555260) mentioned, even `rsqrtps` plus one Newton iteration is worse latency on CPUs at that time than `divps`. And yes, `1/d` will necessarily have rounding error, unless the reciprocal is exactly representable (when it's a power of 2 like 1/8, so the value is a fraction with a power-of-2 denominator.) In general, 2 rounding steps can produce a different final result than one rounding. – Peter Cordes Dec 05 '22 at 02:46
@Jirka: Since multiply or dividing by a power of 2 is exact, perhaps for summing `1/n`, you could save time by computing `1/2n` and `1/4n` by multiply by 0.5 and 0.25 respectively (or FMA). Sum into separate accumulators to add smaller results to smaller totals. I guess this would entail breaking up the value-range into quadrants, and only doing the odd values in the 2nd half or something? I haven't thought of a great way to avoid double-dipping any reciprocals. – Peter Cordes Dec 05 '22 at 02:53
Using `Vec8f ` for counter values makes sense. Thanks for the hint! As for other precision reciprocals variants - I would prefer to avoid intrinsics and use VLC whenever possible, – Jirka Dec 05 '22 at 03:00
"Ice Lake has one 256-bit divide unit on port 0, not two." @Peter - my understanding is that Icelake has one AVX FP divider as part of AVX-256 and another (256 bit wide) divider as part of AVX-512, but I might be wrong. The fact is that dividing two 256-bits packed double vectors on Intel is faster than dividing the one packed double 512-bits vector, suggesting that there are two independent 256-bits dividers. See the runtimes at the bottom of this page: https://github.com/jirka-h/AVX512 – Jirka Dec 05 '22 at 03:10
@Jirka: Agner Fog's and https://uops.info/'s testing of `vdivps` and `vdivpd`, including performance counters for uops per port, is pretty conclusive proof that port 0 is the only execution port with a divider unit, which is partially pipelined, same as all previous Intel CPUs. IDK what you mean by "as part of AVX-512", since the VEX encoding of `vdivpd ymm` has the same throughput as `vdivpd zmm`. There's plenty of scope for giving it more throughput by pipelining more or less heavily (latency vs. throughput), rather than replicating the unit, although apparently AMD does replicate dividers – Peter Cordes Dec 05 '22 at 03:21
The ZMM version of the instruction costs extra uops on Intel, from which we infer that the hardware is only 256 bits wide, and that 512-bit needs to shuffle to split or recombine the halves. That may be costing you throughput in your program where it could introduce a bottleneck on something else. The ZMM version also has nearly twice the latency, so perhaps that's what's hurting your benchmark, with OoO exec unable to hide that much latency. Or that running any 512-bit uops means the vector ALUs on port 1 are shut down, costing uop throughput if you're not *just* doing division. – Peter Cordes Dec 05 '22 at 03:21
@Jirka: Re: not using intrinsics. It appears https://github.com/vectorclass/version2/blob/master/vectorf512.h doesn't currently have a wrapper for `_mm512_rcp14_pd`, so if you want efficient high-precision reciprocals, you should just use an intrinsic on your Vec8d. VCL is designed to mix easily with intrinsics, with implicit conversion operators to/from `__m512d` and other vector types. Avoiding intrinsics when there's no other way to get optimal instructions makes no sense to me when you're micro-optimizing and micro-benchmarking. – Peter Cordes Dec 05 '22 at 03:26
@Jirka: Re: your claim that Zen4 has four vector ALU ports: that's true, but not all of them can run every instruction. Only two of them can run vmul / vfma, and a different two can run vaddps/pd. And you definitely would not expect all of them to have divide units!!! It's surprising that Zen4 can apparently run divide on more than one port. FP dividers are large and often rarely used. Only very recently with transistor budgets getting huge does it make sense to spend that many transistors providing them for the ideally-rare case of code that bottlenecks on their throughput. – Peter Cordes Dec 05 '22 at 03:30

score 0 · Accepted Answer · answered Jan 02 '23 at 23:25

I got the answer for the question from title: How to analyze the instructions pipelining on Zen4? directly from AMD:

For determining if a workload is backend-bound, the recommended
method on Zen 4 is to use the pipeline utilization metrics. We are
the process of providing similar metrics and metric groups through
the perf JSON event files for Zen 4 and they will be out very soon.

How to analyze the instructions pipelining on Zen4 for AVX-512 packed double computations? (backend bound)

2 Answers2

Fast-reciprocal + Newton iterations