Determine number of AVX-512 FMA units

Question

Is there a possibility to determine the number of AVX-512 FMA units during runtime using C++?
I already have codes to determine if a CPU is capable of AVX-512, but I cannot determine the number of FMA units.

Number of units per physical core? Per NUMA node? Per socket? Per system? — Daniel Langr, May 26 '22 at 14:49
@gerum When I have a CPU that supports AVX2 and AVX-512, but has only one FMA unit, it does not make sense for my code to use the AVX-512 branch. In that case the AVX-512 branch would ideally be as fast as the AVX2 branch. — vydesaster, May 26 '22 at 15:10
@DanielLangr Per physical core. So the number would be 1 or 2 for current Intel Xeon CPUs...just an example. — vydesaster, May 26 '22 at 15:11
Have a list of CPUs and their number of AVX512 cores or run a benchmark at runtime. Hopefully there are better solutions ... — Sebastian, May 26 '22 at 18:06
@vydesaster Sorry, but I don't follow. Why wouldn't it make sense to use the AVX-512 branch with one FMA unit, if it can operate on zmm registers? — Daniel Langr, May 26 '22 at 20:38
@DanielLangr: [Lowered CPU frequency](//stackoverflow.com/questions/56852812/simd-instructions-lowering-cpu-frequency) might be one reason on some systems, and also shutting down the vector ALU on port 1 if there's other work (booleans, not just FMAs). OTOH, getting twice as much work done per instruction should still compensate. But if some of the problem didn't scale perfectly to wider vectors, it might need more shuffling in some steps. Also, 512-bit vectors are more sensitive to 64-byte alignment, vs. AVX2 performing well even without 32-byte alignment if you're bottlenecked on L2/L3 — Peter Cordes, May 26 '22 at 20:45
@DanielLangr: I find it plausible that on some real CPUs, they've tested and found better performance with the AVX2 version of their code. Especially if the compiler didn't do a perfect job with the AVX-512 code. But equally plausible to still get some speedup, from wider vectors letting OoO exec see farther ahead. (The same amount of work takes half the number of entries in the ROB, RS, store-buffer, and load-buffer). If the penalty is mostly from lower CPU freq, that's a smaller factor on Ice Lake client (https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html) vs. Skylake-X — Peter Cordes, May 26 '22 at 20:56
BTW, you can probably alternate benchmarking short loops with `rdtsc`/`lfence`, like 100 iterations of `times 3 vmulps zmm0, zmm1,zmm1` / `dec ecx/jnz`, then the same with YMM, until the times settle down to be the same as last interval, and a factor of 2x or 1x. There are various warm-up effects, though, like ZMM throughput being throttled if the CPU is currently above the "l2 licence", vs. a hard transition if above the l1 license. With lfence;rdtsc you're stopping OoO exec so you can use short timed regions. — Peter Cordes, May 26 '22 at 21:40

score 5 · Accepted Answer · edited May 31 '22 at 08:01

The Intel® 64 and IA-32 Architectures Optimization Reference Manual, February 2022, Chapter 18.21 titled: Servers with a Single FMA Unit contains assembly language source code that identifies the number of AVX-512 FMA Units per core in an AVX-512 capable processor. See Example 18-25. This works by comparing the timing of two functions: one with FMA instructions and another with both FMA and shuffle instructions.

Intel's optimization manual can be downloaded from: https://www.intel.com/content/www/us/en/developer/articles/technical/intel-sdm.html#inpage-nav-8.
The source code from this manual is available at: https://github.com/intel/optimization-manual

Peter Cordes · Answer 2 · 2022-06-13T19:29:11.230

There isn't a CPUID feature bit for this. Your options include a microbenchmark at startup, or checking the CPUID vendor string against a table. (If building the table as a cache of microbenchmark results, make sure the microbenchmark is careful to avoid false negatives or false positives, moreso than you'd be for one run at startup.)

If you have access to HW perf counters, perf stat --all-user -e uops_dispatched_port.port_0,uops_dispatched_port.port_5 in a loop that does mostly FMA instructions could work: existing CPUs with a second 512-bit FMA unit have it on port 5, so if you see counts for that port instead of all port 0, you have two FMA units. You might use a static executable that just contains a vfma... / dec/jne loop for 1000 iterations: only your instructions in user-space. (Making it easy to use perf stat.)

Intel's version seems like overkill, and some clunky choices

I think you can microbenchmark it without wasting so many cycles waiting for warm-up, by alternating two benchmark loops, YMM and ZMM, if you're careful about it. Intel's version (github source from their optimization manual) seems like huge overkill with so many registers and a bunch of useless constants when they could just use FMA on 0.0, and a shuffle with no control vector, or vpand or whatever.

It also runs a long warm-up loop, maybe taking multiple milliseconds when you hopefully only need microseconds. I don't have hardware to test on, so I haven't fleshed out the code examples in my suggestion.

Even if you want to use Intel's suggestion more or less unchanged, you can still make it waste less space in your binary by not using so much constant data.

Shuffle like vmovhlps xmm0, xmm0, xmm0 or vpunpckhpd x,x,x run on port 5 only even on Ice Lake and later. ICL/ICX can run some shuffles like pshufd or unpckhqdq on port 1 as well, but not the ZMM versions.

Picking a 1-cycle-latency shuffle is good (so something in-lane, not lane-crossing like vpermd), although you don't even want to create a loop-carried dependency with it, just throughput. i.e. shuffle the same source into multiple destination regs.

Picking something that definitely can't compete with the FMA unit on port 0 is good, so a shuffle is better than vpand. Probably more future-proof to pick one that can't run on port 1. On current CPUs, all the vector ALUs are shut down when any 512-bit uops are in flight (at least that's the case on Skylake-X.) But one could imagine some future CPU where vpshufd xmm or ymm running on port 1 in the same cycle as vfma...ps zmm instructions run on ports 0 and 5. But it's unlikely that the extra shuffle unit on port 1 will get widened to 512-bit soon, so perhaps vpunpckhpd zmm30, zmm0, zmm0 is a good choice.

With better design, you can hopefully avoid false results even without long warm-up

Confounding factors include the soft throttling of "heavy" instructions when the current clock speed or voltage are outside the requirements of running them at high throughput. (See also SIMD instructions lowering CPU frequency)

But waiting for alternating benchmarks to settle to nearly 1:1 or 2:1 should work, and if you're careful not be thrown off by clock-speed changes in the middle of one. (e.g. check against previous run of the same test, as well as the ratio against the previous.)

Ideally you could run this early enough in program startup that this core might still be at idle clock speed, although depending on what started the process, it might be at max turbo, above what it's willing to run 512-bit instructions with.

Intel's version runs all of one test, then all of the other, just assuming that the warm-up is sufficient and that scheduling competition from other loads didn't distort either run.

Test methods

You could do a quick throughput test on startup, timing with rdtsc. vmulps is easy to make independent since it only has 2 inputs, and is correlated with vfma... throughput on all CPUs so far. (Unlike vaddps zmm which is 0.5c throughput on Alder Lake P-cores (with AVX-512-enabled microcode) even though they only have 1c mul/fma. https://uops.info/. Presumably Sapphire Rapids will be the same for versions with 1x 512-bit FMA unit.)

It might be sufficient to do these steps in order, timing each step with lfence;rdtsc;lfence so you can use short benchmark intervals without having out-of-order exec read the TSC while there are still un-executed parts.

vaddps zmm1, zmm1, zmm1 to make sure ZMM1 was written with a uop of the appropriate type, to avoid weird latency effects.
times 3 vmulps zmm0, zmm1, zmm1 in a loop for maybe 100 iterations (thus a 4 uop loop since dec ecx/jnz will macro-fuse, no front-end bottleneck on Skylake-X). If you want, you could write 3 different ZMM registers, but writing ZMM0 3 times is fine.
times 3 vmulps ymm0, ymm1, ymm1 in a loop for maybe 100 iterations
times 3 vmulps zmm0, zmm1, zmm1 in a loop again.

If the ZMM times match between the first run within maybe 10%, you're done, and can assume that the CPU frequency was warmed up before the first run, but only to the AVX-512 "heavy" turbo limit or lower.

But that likely won't be the case unless you were able to do some useful startup work before this using "heavy" AVX-512 instructions. That would be the ideal case, taking at worst a small penalty during work your program already needs to do, before the benchmark runs.

The reference frequency might be significantly different from the actual core clock frequency the CPU can sustain, so unfortunately you can't just repeat this until you see close to 1 or 2 MULs per RDTSC count. e.g. i5-1035 Ice Lake client, TSC = 1.5 GHz, base = 1.1 GHz as reported by BeeOnRope. (Max turbo 3.7GHz). His results are 0.1 GHz higher than what Intel says is the "base" and max turbo, but I assume the point still stands that AVX-512 heavy instructions don't tend to make it run anywhere near the TSC frequency. In a VM environment after migration from different hardware, it's also possible for RDTSC to be transparently scaling and offsetting the counts (HW supported).

No "client" CPUs have 2x 512-bit FMA units (yet)

In "client" CPUs, so far only some Skylake-X CPUs have 2 FMA units. (At least the "client" Ice Lake, Rocket Lake, and Alder Lake CPUs tested by https://uops.info/ only have 1c throughput FMA for 512-bit ZMM.)

But (some?) Ice Lake server CPUs have 0.5c FMA ZMM throughput, so Intel hasn't given up on it. Including for example the Xeon Gold 6330 (IceLake-SP) that instlatx64 tested with 0.5c VFMADD132PS zmm, zmm, zmm throughput, same as xmm/ymm.

Determine number of AVX-512 FMA units

2 Answers2

Intel's version seems like overkill, and some clunky choices

With better design, you can hopefully avoid false results even without long warm-up

Test methods

No "client" CPUs have 2x 512-bit FMA units (yet)