I got access to the AMD Zen4 server and tested AVX-512 packed double performance. I chose Harmonic Series Sum[1/n over positive integers] and compared the performance using standard doubles, AVX2 (4 packed doubles) and AVX-512 (8 packed doubles). The test code is here.
AVX-256 version runs four times faster than the standard double version. I was expecting the AVX-512 version to run two times faster than the AVX-256 version, but there was barely any improvement in runtimes:
Method Runtime (minutes:seconds)
HarmonicSeriesPlain 0:41.33
HarmonicSeriesAVX256 0:10.32
HarmonicSeriesAVX512 0:09.82
I was scratching my head over the results and tested individual operations. See full results. Here is runtime for the division:
Method Runtime (minutes:seconds)
div_plain 1:53.80
div_avx256f 0:28.47
div_avx512f 0:14.25
Interestingly, div_avx256f takes 28 seconds, while HarmonicSeriesAVX256 takes only 10 seconds to complete. HarmonicSeriesAVX256 is doing more operations than div_avx256f - summing up the results and increasing the denominator each time (the number of packed divisions is the same). The speed-up has to be due to the instructions pipelining.
However, I need help finding out more details.
The analysis with the llvm-mca
(LLVM Machine Code Analyzer) fails because it does not support Zen4 yet:
gcc -O3 -mavx512f -mfma -S "$file" -o - | llvm-mca -iterations 10000 -timeline -bottleneck-analysis -retire-stats
error: found an unsupported instruction in the input assembly sequence.
note: instruction: vdivpd %zmm0, %zmm4, %zmm2
On the Intel platform, I would use
perf stat -M pipeline binary
to find more details, but this metricgroup is not available on Zen4. Any more suggestions on how to analyze the instructions pipelining on Zen4? I have tried these perf stat events:
cycles,stalled-cycles-frontend,stalled-cycles-backend,cache-misses,sse_avx_stalls,fp_ret_sse_avx_ops.all,fp_ret_sse_avx_ops.div_flops,fpu_pipe_assignment.total,fpu_pipe_assignment.total0,
fpu_pipe_assignment.total1,fpu_pipe_assignment.total2,fpu_pipe_assignment.total3
and got the results here.
From this I can see, that the workload is backed bound. AMD's performance event fp_ret_sse_avx_ops.all
( the number of retired SSE/AVX operations) helps, but I still want to get better insights into instructions pipelining on Zen4. Any tips?