1

I am trying to find information regarding the integer & floating point functional units for the processor zen 3 architecture by AMD.

As well as the issue time (minimum time between two operations) & latency of integer & floating point (single & double precision) addition & multiplication.

I was using this link for the architecture. Which starts at page 241 for Zen 3 details.

And this link for the instruction infos. Which starts at page 113.

I have went through uops.info to read up on more of the instruction latencies.

I have also read hack.md

I am not 100% sure if the information I gathered is correctly. The processor I have in mind is a Ryzen 7 5700X, here is what I gathered:

===1===

4 Integer ALU FUs (multiply/divide only use 1 out of 4) & 2 branch units & 3 Address Generation units (Can execute 6 integer instructions per clock cycle on average as long as they are all different types)

6 Floating Point FUs (Including 2 multiply/addition & 2 further addition), 2 Address Generation units

===2===

Issue/latency of IADD : L1 I1

Issue/latency of IMUL : L3 I3

Issue/latency of FADD : L3 or L6 (from uops, not sure where to get issue time)

Issue/latency of FMUL : L3 or L6 (from uops, not sure where to get issue time)

Unsure if the data I gathered for floating point is single or double precision

===3===

Multiply & Add fused has L4 Throughput of 2 FADD, 2 FMUL. Simple integer instruction has throughput 4

I think my information in point 1 is correct. However, I am unable to confirm the latency in part 2, I am also unable to find the issue time for these instructions. I would like some help in verifying the information I gathered and how/where I can find the data I need for part 2.

I have tried reading through both the PDFs (Zen 3 section) as well as uops.info to gather data but I am not confident if what I understand is correct and would like to request assistance in clearing up my misunderstandings.

Donny
  • 21
  • 2
  • *6 Floating Point FUs* - 6 ports, but different functional units can share the same port. e.g. SIMD shuffles and bitwise booleans clearly need different hardware than FP math. And 2 of those 6 ports are dedicated (1 each) to FP store-data and FP->int transfers, so really only 4 FP execution ports (2 having MUL/FMA units, 2 having ADD units, same as Zen 2, but no competition from FP stores anymore.) https://en.wikichip.org/wiki/amd/microarchitectures/zen_3#Key_changes_from_Zen_2 / https://en.wikichip.org/wiki/amd/microarchitectures/zen_2#Block_Diagram shows the 4 SIMD/FP pipes in Zen 2 – Peter Cordes Feb 23 '23 at 11:02
  • @PeterCordes hello, thank you for the clarification, I was wondering where the missing two were, now I know. – Donny Feb 23 '23 at 11:58
  • 1
    What is Issue Time? It's something AMD-specific? uiCA only simulates Intel CPUs and has a fixed 5c delay between issue and dispatch of any instruction. – Margaret Bloom Feb 23 '23 at 12:07
  • @MargaretBloom hello, I updated the post to include the meaning. I learnt of this term from a website that took from some book presumably. It was referring to it as "minimum time between two operations". – Donny Feb 23 '23 at 12:16
  • The minimum time (in cycles) between execution of *dependent* operations is called *latency*. https://uops.info/ measures that for every instruction, or at least an upper bound for instructions where the input and output are in different domains. (Like `movd xmm0, eax` or `vmovmskps eax, xmm0`, `ucomisd xmm0, xmm1`, or loads/stores.) In Intel / x86 terminology, sending a uop to an execution port is "dispatch", while "issue" is alloc/rename and move a uop from the front-end to the ROB+scheduler. Those two terms are swapped vs. many other computer-architecture texts. – Peter Cordes Feb 23 '23 at 12:25
  • BTW, `FADD` is legacy x87 (80-bit long double). For scalar float or double, and SIMD, look at `addss` (scalar single) / `addsd` (scalar double), and `addpd` (packed double) / `vaddpd ymm` (256-bit vectors of 4 doubles). Look at compiler output to see what instructions compilers actually use; [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) – Peter Cordes Feb 23 '23 at 20:41
  • *(Can execute 6 integer instructions per clock cycle on average as long as they are all different types)* - I think the front-end is a bottleneck for that, unless that's changed since Zen 1; issue/rename can handle up to 5 instructions per clock, or up to 6 uops. To get 6 uops through the front-end, at least one of the instructions has to be more than 1 uop. – Peter Cordes Feb 23 '23 at 20:49

1 Answers1

0

Just wanted to close this question. I think this is it for zen 3. Not exactly sure. https://i.stack.imgur.com/Hfp7U.png

Donny
  • 21
  • 2
  • `fadd` is an x87 instruction so it's confusing to use it for talking about other FP addition instructions if that's what you're doing in the 2nd section, about xmm/ymm. The instructions that operate on xmm/ymm registers are `addss` / `addsd` and `vaddps/pd`. – Peter Cordes Feb 25 '23 at 09:48