5

I spent quite a lot of time hand-optimizing low-level integer arithmetic, with some success. For instance, my subroutine for 6x6 multiplication spends 66 ticks compared to 82 ticks of mpn_mul_basecase(6,6) on Skylake. My code is published on Github.

I am currently working on 8x8 multiplication for AMD Ryzen. I'm using Ryzen 7 3800X for benchmarking. I try hard to avoid latencies. I've studied Agner Fog's "Instruction tables" and also Torbjörn Granlund's "Instruction latencies ...". Nothing suggests major problems with adox/adcx on Ryzen; there should be no big difference between Ryzen and Skylake concerning adox/adcx. I've benchmarked a multiply 8x1 subroutine using mulx and one of adcq, adox or adcx; all three variants of the subroutine run fast both on Skylake and Ryzen (18-19 ticks).

However when I attempt to mix together adox and adcx, my code runs awfully slow on Ryzen. For instance, my 8x2 multiplication subroutine spends 34 ticks on Skylake i7-6700 and 293 ticks on Ryzen 7 3800X (8 times difference).

Any suggestion why the mulx/adox/adcx code performs 8 times slower on Ryzen?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Does Zen have partial-flag merging stalls that make 2 independent dep chains (CF and OF) worse than the latency bottleneck of a single dep chain? I don't remember Agner Fog mentioning details for that in his microarch guide for Zen, but instruction tables usually test just the same instruction, not alternating adcx/adox. Does Zen have perf counters for any events that might be relevant? Like stalls, or back-end uops, or even merging uops specifically? – Peter Cordes Nov 29 '20 at 21:42
  • It seems like the whole point of acdx and adox is to use them as two independent chains. It'd be kind of a dirty trick on AMD's part to implement the instructions without the microarchitectural features that would actually give them acceptable performance. – Nate Eldredge Nov 30 '20 at 01:05
  • @NateEldredge: Agreed. There's some justification for AMD implementing BMI2 with slow `pdep` / `pext` - setting that CPUID feature bit lets code use several other useful instructions that are fast, notably `shlx` and so on. (Same speed as `shl` on AMD, but faster than `shl` on Intel). But the ADX feature bit *only* provides those 2 instructions, so if they inherently cause major stalls in almost all cases where you'd use them instead of normal `adc`, it's probably better not to provide them at all. They don't use a VEX encoding or anything which makes them more useful in single-dep-chain cases – Peter Cordes Nov 30 '20 at 05:18
  • Because of Haswell, BMI2 `mulx` without ADX is a useful code-path to have anyway for bigint multiply. Intel shows both ways in this whitepaper. https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/ia-large-integer-arithmetic-paper.pdf. AMD leaving out that feature bit would get code to use that version, if it has versions for all 3 cases. – Peter Cordes Nov 30 '20 at 05:20
  • (Of course, maybe it's possible that there's some beneficial way to use ADCX/OX on Zen. It's even plausible the version using both ADCX and ADOX causes some other kind of stall. e.g. one unlikely possibility is Zen2's interesting store-forwarding behaviour: if the same addressing mode / register is used it can be as fast as a register, but can predict wrong and cause a stall if 2 pointers alias.) – Peter Cordes Nov 30 '20 at 05:37
  • 2
    It looks like the code is using xmm registers as scratch space, and moving them to and from integer registers for calculations, which seems a little unusual. Any chance that's to blame? – Nate Eldredge Nov 30 '20 at 05:53
  • @NateEldredge [That mul8x2 subroutine](https://gist.github.com/krisk0/f10ed99bb8f81ad87776db6f3dfb7ece) uses xmm12 to store intermediate value, and 6 other xmm's to save/restore callee-save registers. When using 56(%rdi) instead of xmm12, the code becomes even slower: 595 ticks. [mul8x1](https://gist.github.com/krisk0/839e18eb6d96654931cc3fb874061a20) that uses 6 xmm registers to save/restore rbp rbx r12 r13 r14 r15 only costs 19 ticks – Денис Крыськов Nov 30 '20 at 11:42

1 Answers1

2

Getting rid of heavy xmm/ymm usage solved the problem.

modified subroutine only costs 42 ticks.

Looks like Ryzen has no problems with adox/adcx. Ryzen obviously has problems with vmovdqu mem to register and/or vpextrq and/or vperm2i128.

The question was silly.

@NateEldredge Your hint was helpful. Thank you.

  • You're still using XMM regs. Zen 2 has zero-latency store-forwarding so it probably makes sense to use store / reload instead of ALU `movq` to/from XMM regs, if you can't reorder your operations to not need as many values live at once in architectural registers. (Remember, all x86 CPUs have out-of-order execution, so it's fine to do multiple operations in one dep chain before starting the next, if that lets you finish with a value so you don't need to save it for later. Scheduling instructions to mix multiple dep chains, or do critical-path first, is good when you have enough regs though.) – Peter Cordes Nov 30 '20 at 20:46
  • Also, your function looks like a helper for code that already has values in R15, RAX, and so on. Put `vzeroupper` at the top of that function. Or don't use it at all: the standard way is for YMM-upper-dirtying functions to user `vzeroupper` before calling functions that might not be AVX-aware, and before returning. At the top of a normal function that doesn't take any YMM args, you can assume the CPU is in clean-upper-halves state. – Peter Cordes Nov 30 '20 at 20:51
  • @Peter Cordes I tried to store intermediate values in target memory, instead of scratch xmm registers. This sometimes results in speed-up, sometimes in slow down. For instance, [this 106 ticks subroutine](https://gist.github.com/krisk0/2658fd2fb8d8ce97f2640216768eb035#file-mul_8x8_ryzen-s) uses xmm scratch in first half of code (multiplying by v[0..3]) and memory scratch in second half (multiplying by v[4..7]). Replacing memory scratch with xmm scratch or vice versa results in slowdown of this code. – Денис Крыськов Dec 01 '20 at 06:08
  • @Peter Cordes I only have **movq r64,xmm** and **movq xmm,r64** and no other vector instructions. I followed your advice and removed **vzeroupper**. This brought slow-down (from 106 ticks to 110) – Денис Крыськов Dec 01 '20 at 06:12
  • Yes, I saw the only instruction you use is `movq`, but you have quite a few of them. One advantage to memory is that you may be able to use a memory source operand like `adc -16(%rsp), %rax` instead of a separate `movq` reload. And like I said, on your Zen 2 the store/reload can have effectively zero latency. Also, Zen2's front-end is wider than the integer back-end, so I hoped that some stores instead of ALU movq might leave more execution units free for useful work. On non-Windows where you can use the red-zone, you don't need any instructions to move RSP around. – Peter Cordes Dec 01 '20 at 06:21
  • Maybe there's nothing to be gained, e.g. if you're already bottlenecked on latency of ADC dep chains, but saving total instructions can let OoO exec see farther ahead. But it's strange that removing `vzeroupper` would give a slowdown. Are you sure you're measuring accurately? Note that AMD CPUs adjust their max turbo based on temperature so it can be hard to measure cycles if you're actually measuring time, not cycles directly using perf counters. – Peter Cordes Dec 01 '20 at 06:23
  • Or maybe using stack space would compete with loads / stores of the useful data; certainly possible. I guess you're already effectively using some memory scratch space by `mov`ing into the destination and then using memory-destination `adc`. On Intel CPUs that would be bad: memory destination adc decodes to more uops on Intel than `adc (mem), reg` + `mov reg,(mem)`. But if AMD doesn't have that problem, that can be good. – Peter Cordes Dec 01 '20 at 06:28
  • I left a comment on https://gist.github.com/krisk0/4f80a9ab2d04dd5d25eee21f2ae17fe4 to point out more optimizations, code size and saving instructions. – Peter Cordes Dec 01 '20 at 06:53
  • @Peter Cordes concerning measurement. I disabled turbo boost and fixed frequency with `echo 0 > /sys/devices/system/cpu/cpufreq/boost; cpupower frequency-set -d 3900000` – Денис Крыськов Dec 01 '20 at 07:34