I analyzed some code in Godbolt today with LLVM-MCA, and it models the loop at 4 instructions/cycle:
Dispatch Width: 6
uOps Per Cycle: 5.71
IPC: 4.91
Block RThroughput: 19.2
Unfortunately I can't share the code, but I'm confused as to how this is possible since I thought Skylake can only decode 4 instructions/cycle in the frontend.
What is the theoretical maximum instructions/cycle for Skylake? Is there a small code sample which illustrates how to hit the theoretical limit?