The biggest effect of being many uops is on how well it can overlap with surrounding code (e.g. in a loop) that isn't the same instruction.
If a gather is nearly the only thing in a loop, you're mostly going to bottleneck on the throughput of the gather instruction itself, whichever part of the pipeline it is that limits gathers to that throughput.
But if the loop does a lot of other stuff, e.g. computing gather indices and/or using the gather result, or fully independent especially scalar integer work, it might run close to a front-end bottleneck (6 uops per clock cycle issue/rename on Zen 3), or a bottleneck on back-end ALU ports. (AMD has separate integer and FP back-end pipelines; Intel shares ports, although there are a few extra execution ports that only have scalar integer ALUs.) In that case, it would be the uops cost of the gather that contributes to the bottleneck.
Other than branch misses and cache misses, the 3 dimensions of performance are front-end uops, back-end ports it competes for, and latency as part of a critical path. Notice that none of these are the same as just running the same instruction back-to-back, the number you get from measuring "throughput" of a single instruction. That's useful to identify any other special bottlenecks for those uops.
Some uops may occupy a port for multiple cycles, e.g. some of Intel's gather loads are fewer uops than the total number of elements, so they might stop other loads from dispatching at some point, creating more back-end port pressure than you might expect from the number of uops for each port. FP divide/sqrt is like that, too. But since AMD's gathers are so many uops, I'd hope that they're all fully pipelined.
AMD's AVX1/2 masked stores are also a ton of uops; IDK how exactly they emulate that in microcode if they don't have efficient dedicated hardware for it, but it's not great for performance. Maybe by breaking it into multiple conditional scalar stores.
Bizarrely, Zen 4's AVX-512 masked stores like vmovdqu32 (m256, k, ymm)
are efficient, single uop with 1/clock throughput (despite being able to run on either store port, according to https://uops.info/; Intel has 2/clock masked store throughput same as regular stores, since Ice Lake.) If the microcode for vpmaskmovd
would just compare into a mask and use the same HW support as vmovdqu32
, it would be way more efficient. I assume that's what Intel does, given the uop counts for vmaskmovps
.
See also
highly unlikely since Zen processors have big μops cache?
It's not about caching the uops, it's about getting all those uops through the pipeline every time the instruction runs.
A uop with more than 2(?) on AMD, or more than 4 on Intel, is considered "microcoded", and the uop cache just stores a pointer to the microcode sequencer, not all the uops themselves. This mechanism makes it possible to support instructions like rep movsb
which run a variable number of uops depending on register values. On Intel at least, a microcoded instruction takes a whole line of the uop cache to itself (See https://agner.org/optimize/ - especially his microarchitecture guide.)