0

According to Intel Skylake architecture figure, one port can be linked with multiple execution units. Can these units work simultaneously?

For example, if an "integer vector multiplication instruction" is launched from port0, it will use the "Vect ALU" unit. This instruction has a latency of 5 according to Agner. At next cycle, can port0 launch another instruction to the "ALU" unit before previous vector instruction finish its execution stage? What about ALU and FMA unit?

enter image description here

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847

2 Answers2

2

The "port" only controls the dispatch of uops to the set of functional units behind that port, limiting that dispatch to 1 uop per cycle per port.

Pipelined multi-cycle uops will continue executing in their respective functional units while subsequent uops issued to the same port can go to the same or different functional units. Since only one uop per cycle can come through a port, no two uops can be executing in the same pipeline stage of a functional unit, but otherwise interleaving is common - including uops that belong to different thread contexts.

Non-pipelined multi-cycle uops will block the functional unit until they complete, but will not block the port (and therefore not block the other functional units behind the port).

John D McCalpin
  • 2,106
  • 16
  • 19
0

Yes, but you'd have a write-back conflict if two uops would produce a result from the same port on the same cycle. e.g. if an add started on port0 two cycles after an addps, they'd both be ready to produce a result the cycle after that.

1                      addps    starts
2                         |
3    add                  v
4   ready (1c lat) |  ready (4c latency)

I think the scheduler tries to avoid that, and/or something stalls if it happens. With sqrtsd latency being slightly variable and long (15-16 cycles), the scheduler can't be perfect so I assume at least the div/sqrt unit needs a way to stall.

Intel's optimization manual may mention some of this; Andy (Krazy) Glew mentioned in an SO comment having written about some of the complexities in the first version of Intel's compiler-writer's guide for P6.


You can test this if you have a Skylake by running a mix of mostly add instructions with the occasional addps, and see how close you still get close to 4 uops per clock.

Or maybe better, shift instructions (p06 only) and fmul (p0 only), so you aren't also bumping into the front-end bottleneck of 4 uops / clock. Or imul (p1) and bzhi (p15).


On Skylake, port 1 is the only port that can handle scalar integer uops with 3 cycle latency; the rest only handle 1-cycle integer uops. That's why imul, lzcnt, and slow-LEA are all on that port. (It's also the port whose vector ALUs are shut down when there are 512-bit uops in flight, since those presumably work by combining the 256-bit units on p0 and p1 into a 512-bit unit.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847