What is the maximum theoretical peak of GFLOPS in single and double precision for a Xeon Silver 4210 with 40 CPU cores?

Question

I have an Intel Xeon Silver 4210 @ 2.20ghz with 40 cores spread on 2 NUMA nodes. I need to know what could be the maximum theoretical GFLOPS for this architecture for single and double precision arithmetics.

The values I have found around the web are very different from one another, so I don't know which one to take into account, and also the formulas I have found are not the same and lead to different results (some say 1760 GFLOPS for single precision and 352 for double precision, others 2816 GFLOPS for double precision).

Moreover, Intel in this document https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Xeon-Processors.pdf reports a value of 153.6 GFLOPS.

What should I expect the correct value to be?

It doesn't have an integrated GPU, just x86-64 cores, so double-precision FLOPS are half single-precision, because it's some number of FMA instructions per clock cycle on 64-byte vectors. And those will have the same per-instruction throughput for vectors of 16 singles or 8 doubles. Probably 1 vector per core clock since I don't think Xeon Silver CPUs have a 2nd 512-bit FMA unit on port 5. — Peter Cordes, Apr 26 '23 at 11:23
https://ark.intel.com/content/www/us/en/ark/products/193384/intel-xeon-silver-4210-processor-13-75m-cache-2-20-ghz.html says "# of AVX-512 FMA Units: 1". So same throughput for AVX-512 as for 256-bit vectors (FMA3). 16 single-precision FMAs per clock per core, thus 32 FLOPs per cycle per core, at whatever sustained all-cores turbo it can manage. (Probably just 2.2GHz since all cores running FMAs at max throughput is the max power workload!) Note that it's physical cores that matter for peak throughput, not logical; the execution units are competitively shared between threads. — Peter Cordes, Apr 26 '23 at 11:30
The Intel thing is for double precision: https://en.wikipedia.org/wiki/Adjusted_Peak_Performance . If you remove the 0.3 weight, it would be 512. Anyway theoretical max GFLOPS don't seem like a very useful number. Why do you need to know? And do you need a low number for export (take Intel's number), high number for bragging (take the high one) or an accurate number (measure it yourself)? — teapot418, Apr 26 '23 at 11:30
In this case, 2816 single-precision GFLOP/s is correct for a quad-socket machine with 40 physical cores. 1408 DP GFLOP/s. Half those numbers if you actually only have 2x Xeon 4210 for 20 physical cores. If you don't care about computing anything useful, it's fairly straightforward to write loops in assembly language that sustain this throughput with *just* FMAs, no loads, stores, or other instructions except a loop branch. As @teapot418 says, if you care about performance on any real problem, compile with `clang -O3 -march=native` and benchmark. — Peter Cordes, Apr 26 '23 at 11:37
@teapot418 I'm working on a project for a University course and I wanted to compare the GFLOPS I have obtained with the theoretical one for completeness. Just a quirk. — lilith, Apr 26 '23 at 13:57
@PeterCordes yeah, sorry, 20 physical cores - 40 on-line CPUs. Thanks for your indications. — lilith, Apr 26 '23 at 14:10

What is the maximum theoretical peak of GFLOPS in single and double precision for a Xeon Silver 4210 with 40 CPU cores?

0 Answers0