2

I analyzed some code in Godbolt today with LLVM-MCA, and it models the loop at 4 instructions/cycle:

Dispatch Width:    6
uOps Per Cycle:    5.71
IPC:               4.91
Block RThroughput: 19.2

Unfortunately I can't share the code, but I'm confused as to how this is possible since I thought Skylake can only decode 4 instructions/cycle in the frontend.

What is the theoretical maximum instructions/cycle for Skylake? Is there a small code sample which illustrates how to hit the theoretical limit?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Elliot Gorokhovsky
  • 3,610
  • 2
  • 31
  • 56
  • Note: I did run LLVM-MCA with `-mcpu=skylake` as a CLI param. – Elliot Gorokhovsky Oct 15 '22 at 19:22
  • 2
    I believe Skylake has a uop cache, so if your loop is very small, it only has to decode the instructions once. – Nate Eldredge Oct 15 '22 at 20:30
  • 2
    @NateEldredge: The narrowest point in Skylake's pipeline is the issue/rename stage which is 4 **uops** wide (fused-domain, so micro-fused uops count as 1). With a tight loop ending with `dec/jnz`, 5 instructions decode to 4 uops. SKL and HSW can both run that at 5 IPC, and macro-fusion is the only way they can exceed 4 IPC. (The limit being 6 instructions with two macro-fused branches, at least one of which being not-taken.) Related re: unfused-domain throughput limits: https://www.agner.org/optimize/blog/read.php?i=415#852 7 execution ports busy every clock. – Peter Cordes Oct 17 '22 at 07:04
  • 1
    @NateEldredge: And BTW, the uop cache is hundreds of "lines"; if packed perfectly, up to 1536 uops can be cached. Even largeish loops involving multiple function calls can run mostly from the uop cache. The loop buffer (LSD) is disabled by microcode updates, otherwise loops of 64 uops or less could run from it, which is small but not "very small". – Peter Cordes Oct 17 '22 at 07:18
  • Skylake's legacy decode can handle up to 6 instructions per cycle, if it can make two macro-fusions. There are only 4 actual decoders, which can produce a total of up to 5 uops. (The diagrams at https://en.wikichip.org/wiki/intel/microarchitectures/skylake_(client)#Fetch_.26_pre-decoding are wrong; there is no 5th simple decoder, you only get 5 uops per cycle decoded with patterns like 2-1-1-1 or 3-1-1, starting with a multi-uop instruction, to help it close up the bubble if a multi-uop instruction hit a simple decoder last cycle so it produced less than 4.) – Peter Cordes Oct 17 '22 at 07:22

0 Answers0