0

Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory):

enter image description here

Each FMA does 1 multiplication and 1 add while fadd does only 1 add. So does this mean theoretically it can do a total of 2 multiplications and 4 adds = 6 operations of 256 bits each?

Assuming all 4adds and 2 muls can be issued in same cycle, can this mean 256bits (or just 8 floats of 32bit precision) x 6 = 48 elements are computed per cycle (or 48 gflops/s per GHz)?

Assuming all operands are in registers, there should be enough bandwidth to get the data to fpu (the L1 bandwidth says 2x256 bits per cycle for reading is only enough for 8 flops per cycle but registers must be much faster), but the fpu throughput isn't clearly shown.

How does this compare to Intel 11/12/13 gen? For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s? Is it fair to compare cpus with different ratios of muls and adds for flops-to-flops? Looks like amd is better on:

d += a * b + c;
// or
d += a * b;
e += c;

while intel is better on:

d = a * b + c;
// or
d+=a*b;

per gflops. Intel's flops value looks better for matrix multiplication and blending. AMD's flops value looks better for chained matrix add & multiplication and some loop with float accumulator & matrix multiplication.

So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97

1 Answers1

1

Yes, 48 FLOP / cycle theoretical max throughput on Zen 4 if you have a use for adds and FMAs in the same loop.

I'd guess that usually this is most useful when you have many short-vector dot products that aren't matmuls, so each cleanup loop needs to do some shuffling and adding. Out-of-order exec can overlap that work with FMAs.

And in code not using FMAs, you still have 2 mul + 2 add per clock, which is potentially quite useful for less well optimized code. (A lot of real-life code is not well optimized. How many times have you seen people give advice to not worry about performance?)

Also with a mix of shuffles and other non-FP-math vector work, that can run on a good mix of ports and still leave some room for FP adds and multiplies.


AFAIK, Zen 4 can keep both FMA and both FP-ADD units busy at the same time, so yes, 2 vector FMAs and 2 vector vaddps every cycle. So that's 6x vector-width FLOPs. It doesn't make sense to call it "4adds and 2 muls" being issued (and dispatched to execution units) in the same cycle, though, since the CPU sees them as 2 FMA and 2 ADD operations, not 6 separate uops.

So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?

Yes, standard matmal is all FMAs, little to no use for extra FP-add throughput.

Maybe some large-matrix multiplies using Strassen's algorithm would result in a workload with more than 1 addition per multiply, if you can arrange it such that the adding work overlaps with multiplying.

Or possibly run another thread on the same physical core doing the adding work, if you can arrange that without making things worse by competing for L1d cache footprint and bandwidth. HPC workloads sometimes scale negatively with SMT / hyperthreading for that reason, but partly that's because a well tuned single thread can use all the FP throughput from a single core. But if that's not the case on Zen 4, there's some theoretical room for gains.

However, that would require your FMA code to need less than 1 load per FMA, otherwise load/store uops will be the bottleneck if a submatrix-add thread is trying to load+load+add+store at the same time as a submatrix-multiply thread is doing 2 loads + 2 FMAs per clock.


For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s?

And yes, Intel CPUs with a second 512-bit FMA unit (like some Xeon scalable processors) can sustain 2x 512-bit FMAs per clock if you optimize your code well enough (e.g. not bottlenecking on loads+stores or FMA latency), so that gives you 2x 16 single-precision FMAs = 64 FLOP/cycle.

Alder Lake / Sapphire Rapids re-added separate execution units for FP-add, but they're on the same ports as the FMA units, so the benefit is lower latency for things that bottleneck on the latency of separate vaddps / vaddpd, like in Haswell. (But unlike Haswell, there are two of them, so the throughput is still 2/clock.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Would having more physical registers than 192 help more than the extra fadd units? I don't know if fpu logic is comparable to register space in terms of die area. Or would even boosting the division/sqrt which is only 128bits wide already help in simulations? Or compiler can do some integer add emulated by fadd up to some limit? – huseyin tugrul buyukisik May 07 '23 at 17:12
  • 1
    @huseyintugrulbuyukisik: More physical registers allows a larger out-of-order execution window. It totally depends on your workload how easy the ILP is to find. div/sqrt throughput and latency only matters if your code uses it. So again, depends on your code. As for integer add, compilers would just use `vpaddd` or `vpaddq` SIMD-integer instructions, which on Intel CPUs can run on more ports than FMAs. – Peter Cordes May 07 '23 at 17:15
  • Is it possible to put such an FPGA module into core without losing performance to have transformability between an FADD and an FMUL depending on the thread's incoming instructions? – huseyin tugrul buyukisik May 07 '23 at 17:16
  • 1
    Doing SIMD integer division for small integers with subnormal floats would work but is unlikely for compilers to do in practice. It might also be slow if it takes an FP assist like subnormals often do, and flush-to-zero would kill it. So compilers won't do that for you, but you could try it yourself. Integer FMA with subnormal floats (just your integer as the mantissa) would require extra instructions to stuff the exponent of `1.0` into one of them, or something like that, and probably be slow. – Peter Cordes May 07 '23 at 17:17
  • 1
    @huseyintugrulbuyukisik: Almost certainly not possible for an FPGA to provide extra execution ports for standard instructions; that logic is hard-coded into the custom silicon of the scheduler and issue/rename/allocate stage, and even then the scheduler is still pretty power-intensive. Building it out of reprogrammable logic would probably be impossible at the clock speeds CPUs run at. – Peter Cordes May 07 '23 at 17:21
  • do you think aida64 gpgpu benchmark uses matrix multiplication for the "gflops" part? – huseyin tugrul buyukisik May 07 '23 at 17:26
  • 1
    @huseyintugrulbuyukisik: I have no idea. – Peter Cordes May 07 '23 at 17:27
  • 1
    @huseyintugrulbuyukisik: A use-case for FMAs *and* FADD in the same loop: your answer on [Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?](https://stackoverflow.com/a/72319428). Should run nicely on Zen 4 if you unroll the FP increment that feeds the Horner's Rule FMAs. Or the `vpaddd` / `vcvtdq2pd` version: int-to-FP conversion runs on ports FP23 (same as `vaddpd`) on Zen 3 and 4 (https://uops.info/). – Peter Cordes May 08 '23 at 05:35
  • yeah it looks good on amd. When I get a ryzen I will benchmark it. (On windows 11 and ubuntu 20) – huseyin tugrul buyukisik May 08 '23 at 06:49
  • gcc-v11 has it 0.5 cycles per element while gcc-v12 has 0.252 cycles per element which is better than godbolt's shared server core (same flags for avx512). I used the same code from last godbolt sample. 5 flops per element x 4 elements per cycle x 5.4 gigacycles per second = 108 gigaflops /s per core. Still slower than peak flops. – huseyin tugrul buyukisik May 13 '23 at 08:59
  • theoretical L1 write throughput is 256bits per cycle so it does only 1 iteration(avx512) per 4 cycles (index vector is written + output is written = 1024bits) ==> for 16 elements ==> 1/4 cycles per element so its very close to this in practice. – huseyin tugrul buyukisik May 13 '23 at 09:14