Intel vs AMD gather AVX performance

Question

I noticed that some of AVX instruction on Zen2 have ridiculously high μops cost compared to their Intel counterparts. According to μops table:

VPGATHERDD	Skylake	Zen3
Latency (clocks)	[0;22]	[0;28]
Reciprocal TP (Measured)	5.00	8.00
μops (Measured)	5	39

These numbers look like something that can affect gather performance. This question is similar to some old scalar vs gather questions, but those question was more about Intel and didn't discussed Excavator/Zen μops gather cost at all. Maybe it's because AMD CPUs weren't popular at that time, but today it's more relevant. The only explanation to such a big difference I found is some random comment claiming gathers are microcoded in AMD CPUs. I didn't find any additional explanation neither in Agner Fog nor in AMD programming guidance.

I tried to make some benchmark* on Zen3, Skylake and Broadwell processors to see how scalar load compares to gather.

Broadwell	Skylake	Zen3
1x	1x	1x
1.5-2.1x	3.1-6x	1x

Difference in throughput should have made about 1.6x (8/5) difference in favor of Intel. How much can be accounted to difference in μops?

Can big μops cost hurt out-of-order execution when mixed with real code or this is highly unlikely since Zen processors have big μops cache? Is there any better benchmark for this?

*Initial benchmark was wrong; link and numbers in the table are fixed now.

I just loaded the updated Godbolt link and happened to get an AuthenticAMD instance. The times were 9472876 ns for the first, 8135083 for the 2nd. So gathers are slightly slower as expected; assuming the scalar version ran about 1 element per cycle, the gather version was pretty close to Agner Fog's and instlat's throughput measurement of 9 cycles per vector of 8 elements. (Assuming that CPU frequency had ramped to max before starting the vector loop, i.e. that the rand() calls were enough warm-up.) — Peter Cordes, Apr 04 '23 at 03:02
I recheked your fixed bench version with -O2 again and consistently getting around 3.5M nanoseconds for both versions (with gathers being 5% slower on average) on Zen3 4 Ghz CPU. It seems you are right. That would mean that gathers in Zen3 are around as useless as in Haswell. — Vladislav Kogan, Apr 04 '23 at 03:22
Yes, except as part of a larger algorithm, if your indices are already in a vector as the result of a computation, and/or you need the result in a SIMD vector for further computation. Then you save on unpacking to scalar and shuffling back into vectors. — Peter Cordes, Apr 04 '23 at 03:25
Unfortunately Zen 4 still doesn't have fast gathers or scatters. It has fast mask loads, but still very slow masked stores. (So that's a significant gap in its AVX-512 support). Masking loads and other register writes is different (and easier) than masking stores, since that has to get represented in the store buffer, and store-forwarding is a thing. Certainly possible that they'll get around to some of these things in Zen 5 or later, though. — Peter Cordes, Apr 05 '23 at 22:53
Thanks, added as summary. I hope it will be useful since it's easy to be misleaded by AVX gains numbers measured on Intel and don't get them on AMD. — Vladislav Kogan, Apr 07 '23 at 12:26
Correction, Zen 4 has fast AVX-512 masked stores like `vmovdqu32 (m256, k, ymm)`, but for some reason the microcode for AVX1/2 `vmaskmovps/pd` and `vpmaskmovd/q` don't are still tons of uops. Not just comparing into an internal temporary mask and using the HW support. After seeing the numbers for `vmaskmovps`, I just assumed there must be no HW support for efficient masked stores. — Peter Cordes, Apr 17 '23 at 07:06
Yes, you're right. Speaking of HW support: I also noticed that some of "microded" instrustions are microded very well and in fact on pair with Intel. In official Zen4 instruction sheet `VPHAD/VPERM` noted as microcoded without any uops given. In uops table they cost `4/1` uops - just as in Intel's Golven Cove. Assuming Intel didn't microcoded these instructions, it would be really good result - and quite counterintuitive to see. Ton of uops means microcode, few uops doesn't guantee no microcode. — Vladislav Kogan, Apr 24 '23 at 00:25
Any instruction with 1 or 2 uops decodes as "Directpath" (uops generated directly by the decoders), vs. "Vectorpath" (decoder indirects to microcode ROM). https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Instruction_Fetch_and_Decode_Unit I don't know why Zen 4's integer `vphaddd` is 4 uops, vs. 3 uops for `vhaddps`. On Intel, `hadd*` instructions are always 3 uops, except with a memory source. (Presumably 2 shuffles to feed a vertical add.) — Peter Cordes, Apr 24 '23 at 00:33

Peter Cordes · Accepted Answer · 2023-04-17T07:03:22.163

The biggest effect of being many uops is on how well it can overlap with surrounding code (e.g. in a loop) that isn't the same instruction.

If a gather is nearly the only thing in a loop, you're mostly going to bottleneck on the throughput of the gather instruction itself, whichever part of the pipeline it is that limits gathers to that throughput.

But if the loop does a lot of other stuff, e.g. computing gather indices and/or using the gather result, or fully independent especially scalar integer work, it might run close to a front-end bottleneck (6 uops per clock cycle issue/rename on Zen 3), or a bottleneck on back-end ALU ports. (AMD has separate integer and FP back-end pipelines; Intel shares ports, although there are a few extra execution ports that only have scalar integer ALUs.) In that case, it would be the uops cost of the gather that contributes to the bottleneck.

Other than branch misses and cache misses, the 3 dimensions of performance are front-end uops, back-end ports it competes for, and latency as part of a critical path. Notice that none of these are the same as just running the same instruction back-to-back, the number you get from measuring "throughput" of a single instruction. That's useful to identify any other special bottlenecks for those uops.

Some uops may occupy a port for multiple cycles, e.g. some of Intel's gather loads are fewer uops than the total number of elements, so they might stop other loads from dispatching at some point, creating more back-end port pressure than you might expect from the number of uops for each port. FP divide/sqrt is like that, too. But since AMD's gathers are so many uops, I'd hope that they're all fully pipelined.

AMD's AVX1/2 masked stores are also a ton of uops; IDK how exactly they emulate that in microcode if they don't have efficient dedicated hardware for it, but it's not great for performance. Maybe by breaking it into multiple conditional scalar stores.

Bizarrely, Zen 4's AVX-512 masked stores like vmovdqu32 (m256, k, ymm) are efficient, single uop with 1/clock throughput (despite being able to run on either store port, according to https://uops.info/; Intel has 2/clock masked store throughput same as regular stores, since Ice Lake.) If the microcode for vpmaskmovd would just compare into a mask and use the same HW support as vmovdqu32, it would be way more efficient. I assume that's what Intel does, given the uop counts for vmaskmovps.

See also

What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
How many CPU cycles are needed for each assembly instruction? (throughput vs. latency basics)
Floating point division vs floating point multiplication - the opposite case of a low-throughput instruction but which is only single uop. So mixing one vdivps or vsqrtps in every 12 or so vmulps or vfma...ps has basically no throughput penalty.

This is kind of like gathers on Intel; not many uops (although more than 1, and all for load ports) but throughput is limited in the back-end. So they don't take up much front-end bandwidth.

highly unlikely since Zen processors have big μops cache?

It's not about caching the uops, it's about getting all those uops through the pipeline every time the instruction runs.

A uop with more than 2(?) on AMD, or more than 4 on Intel, is considered "microcoded", and the uop cache just stores a pointer to the microcode sequencer, not all the uops themselves. This mechanism makes it possible to support instructions like rep movsb which run a variable number of uops depending on register values. On Intel at least, a microcoded instruction takes a whole line of the uop cache to itself (See https://agner.org/optimize/ - especially his microarchitecture guide.)

I understand latency/throughput topics, however uops is more complicated. Could you expand your idea about gather bottleneck while overlapping with other instructions? vpgatherdd costs 39/8=4.8 uops per clock. What kind of instruction must be near to surpass 6 uops/clock limit? Most other instructions will have uops/clocks ratio closer to 1:1. What intructions should be near to trigger frond-end bottleneck? — Vladislav Kogan, Mar 26 '23 at 01:12
@VladislavKogan: Some integer loop and pointer calculation overhead (`dec`/ `jnz`, and maybe `add rsi, 32` or something), and some other SIMD uops like `vxorps` (0.25c throughput) or `vaddps` / `vmulps` (each 0.5c throughput, but for different ports, so an even mix is also 0.25c). Maybe also an additional store or something. With a mix of different uops that can run on different ports, you can run 6 uops per clock. See https://en.wikichip.org/wiki/amd/microarchitectures/zen_3 and the pipeline diagram for Zen 2 - https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram — Peter Cordes, Mar 26 '23 at 01:22
Thank you. Is there an explanation now vgatherps on Zen3 achieving being faster than scalar mov m/r32 at all? Theoretically it loses both in terms of uops (39 vs (1*8)) and rTP (9 vs (0.5*8=4)). Is there any other factor I forgot to take into account? — Vladislav Kogan, Apr 04 '23 at 01:38
@VladislavKogan: If you're talking about the benchmark code you showed compiled with `clang -O2` (https://godbolt.org/z/q46zvndh7), notice that all the `std::vector`s in `main` were optimized away, not even allocated and no loads or stores. It calls `rand` 2x 8000 times discarding the result, then times empty loops, one of which increments by 8, the other by 1, so it's about 7.6x faster. `asm volatile ("" : : : "memory");` only acts on things whose address has been taken and thus could possibly be globally visible, or reachable via an input operand to the `asm`, but you gave it none. — Peter Cordes, Apr 04 '23 at 02:01
@VladislavKogan: Adding an access to the vector like `volatile int sink = vDest[argc++]` stops clang from optimizing it away, and it does all the work instead of just the element that `argc` will index. So it makes the asm you want to benchmark, something that's very important to check on. https://godbolt.org/z/n9fG79jEb (with a CPUID dump to show whether it ran on an Intel or AMD CPU on Godbolt's AWS instances; sometimes we get a Zen 3.) TODO: dump the model string so we can see SKX vs. Ice Lake. — Peter Cordes, Apr 04 '23 at 02:20
@VladislavKogan: If a correct benchmark still shows Zen3 running gathers faster than scalar, perhaps competition from the scalar load of the index, and scalar store of the result, are also competing, vs. with a gather load you have two `vmovdqu` (one load one store) per 8 indices. So it's not just the throughput of `vpgatherdd` vs. 2/clock loads. Also, I notice the non-inline version of `scalarLoad` compiles to way more loads than when inlined into `main`, perhaps because it can't prove lack of aliasing. (With the `std::vector` control block? But it has pointer elements, not `int`. IDK) — Peter Cordes, Apr 04 '23 at 02:23
No, I mean in general, even without optimization. I've run this code with -O0 on my Zen3 processor and still get substantial speed up. I also run your fixed version as well and getting 24M vs 73M on 5600H with -O0 and 3.1M vs 6.4M with -O2. So yes, correct benchmark still shows gather better than scalar on Zen3. Which is expected since this is the whole point of implementing them. I just don't fully understand how they are faster. — Vladislav Kogan, Apr 04 '23 at 02:47
@VladislavKogan: Benchmarking with `-O0` is meaningless. There's so much overhead from all the loads and stores of local vars, especially with extra indirection for `std::vector`. Normally SIMD with intrinsics is hurt more at `-O0` than scalar code, but probably C++ template overhead changes that. Check the scalar asm that's actually executing with your `-O2` benchmark to make sure it doesn't have a lot of extra overhead. — Peter Cordes, Apr 04 '23 at 02:53
@VladislavKogan: It's also possible that https://uops.info/ numbers for Zen3 aren't really accurate, since they benchmark with an all-zero mask instead of all-ones. So they don't actually do any memory access, and perhaps suppressing things is slower in Zen3's microcode than letting the loads actually run. Agner Fog and http://users.atw.hu/instlatx64/AuthenticAMD/AuthenticAMD0A20F10_K19_Vermeer2_InstLatX64.txt also measured 9 or 9.83c throughput for `vpgatherdd ymm` on Zen 3, though, so maybe it's real. — Peter Cordes, Apr 04 '23 at 02:58

Vladislav Kogan · Answer 2 · 2023-04-23T23:48:02.773

Uops cache isn't the case here. vpgatherdd ymm on AMD Zen3 has both high μops cost and high rTP. Consequently gather instructions on Zen3 (and probably other Zen as well) showing almost the same performance as a scalar code. Hence there is no need to use gathers on AMD processors (except they are part of larger SIMD algorithm).

This is similar to early gather implementation on Intel Sandy Bridge/Haswell, where gathers were equal to scalar code untill Broadwell and Skylake came out. However this may change for the better in future Zen CPUs as it changed before a lot as well.

vpgatherdd ymm	Zen+	Zen2	Zen3	Zen4
μops	65	60	39	42
rTP (CPI)	20	16	8	8

High (>=10x) μops difference also appears in some other instructions. To sum it up, latest Zen4 (with added AVX512 support) still doesn't have fast gathers, scatters or masked AVX/AVX2 store. Maskload and AVX-512 maskstore is fast, however. More details here.

These slow instructions are indeed microcoded. In official AMD guideline their μops cost noted as "ucode" without any exact number given. So they're basically CPU-emulated instead of having dedicated hardware.

Intel vs AMD gather AVX performance

2 Answers2