Yes, but you'd have a write-back conflict if two uops would produce a result from the same port on the same cycle. e.g. if an add
started on port0 two cycles after an addps
, they'd both be ready to produce a result the cycle after that.
1 addps starts
2 |
3 add v
4 ready (1c lat) | ready (4c latency)
I think the scheduler tries to avoid that, and/or something stalls if it happens. With sqrtsd
latency being slightly variable and long (15-16 cycles), the scheduler can't be perfect so I assume at least the div/sqrt unit needs a way to stall.
Intel's optimization manual may mention some of this; Andy (Krazy) Glew mentioned in an SO comment having written about some of the complexities in the first version of Intel's compiler-writer's guide for P6.
You can test this if you have a Skylake by running a mix of mostly add
instructions with the occasional addps
, and see how close you still get close to 4 uops per clock.
Or maybe better, shift instructions (p06 only) and fmul
(p0 only), so you aren't also bumping into the front-end bottleneck of 4 uops / clock. Or imul
(p1) and bzhi
(p15).
On Skylake, port 1 is the only port that can handle scalar integer uops with 3 cycle latency; the rest only handle 1-cycle integer uops. That's why imul, lzcnt, and slow-LEA are all on that port. (It's also the port whose vector ALUs are shut down when there are 512-bit uops in flight, since those presumably work by combining the 256-bit units on p0 and p1 into a 512-bit unit.)