Since amd zen 4 has only 256bit wide operations on vector data, the following diagram from chipsandcheese's Zen 4 article shows 6 FP pipelines (4 ALU and 2 memory):
Each FMA does 1 multiplication and 1 add while fadd does only 1 add. So does this mean theoretically it can do a total of 2 multiplications and 4 adds = 6 operations of 256 bits each?
Assuming all 4adds and 2 muls can be issued in same cycle, can this mean 256bits (or just 8 floats of 32bit precision) x 6 = 48 elements are computed per cycle (or 48 gflops/s per GHz)?
Assuming all operands are in registers, there should be enough bandwidth to get the data to fpu (the L1 bandwidth says 2x256 bits per cycle for reading is only enough for 8 flops per cycle but registers must be much faster), but the fpu throughput isn't clearly shown.
How does this compare to Intel 11/12/13 gen? For example, some workstation xeons had 2x fpu of 512bits each but no dedicated "add"s? Is it fair to compare cpus with different ratios of muls and adds for flops-to-flops? Looks like amd is better on:
d += a * b + c;
// or
d += a * b;
e += c;
while intel is better on:
d = a * b + c;
// or
d+=a*b;
per gflops. Intel's flops value looks better for matrix multiplication and blending. AMD's flops value looks better for chained matrix add & multiplication and some loop with float accumulator & matrix multiplication.
So when doing matrix multiplication, is zen 4 effectively 32 flops per cycle?