FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

Question

I'm confused on how many flops per cycle per core can be done with Sandy-Bridge and Haswell. As I understand it with SSE it should be 4 flops per cycle per core for SSE and 8 flops per cycle per core for AVX/AVX2.

This seems to be verified here, How do I achieve the theoretical maximum of 4 FLOPs per cycle? ,and here, Sandy-Bridge CPU specification.

However the link below seems to indicate that Sandy-bridge can do 16 flops per cycle per core and Haswell 32 flops per cycle per core http://www.extremetech.com/computing/136219-intels-haswell-is-an-unprecedented-threat-to-nvidia-amd.

Can someone explain this to me?

Edit: I understand now why I was confused. I thought the term FLOP only referred to single floating point (SP). I see now that the test at How do I achieve the theoretical maximum of 4 FLOPs per cycle? are actually on double floating point (DP) so they achieve 4 DP FLOPs/cycle for SSE and 8 DP FLOPs/cycle for AVX. It would be interesting to redo these test on SP.

In response to your edit: The numbers would be exactly double the DP numbers. That's because the latencies and throughputs are identical for the SP and DP versions of the SIMD instructions. (In some cases, the SP ones have even lower latency.) — Mysticial, Mar 27 '13 at 13:29
I have converted the code to use SP as best as I understand and compiled it with Visual Studio 2012. However, I don't see a difference in speed and the sum reports an error so likely I need to change some more code. I'll have to get back to this. — , Mar 27 '13 at 14:25
You need to double the numbers since the counter is assuming DP. (Change: `48 * 1000 * iterations * tds * 2` to `48 * 1000 * iterations * tds * 4`) Furthermore, you need to change the renormalization mask to work on SP: `uint64 iMASK = 0x800fffffffffffffull;` — Mysticial, Mar 27 '13 at 14:31
4 due to four SP floats per SSE register. Thanks again. I also changed the renormalization mask to unsigned int iMASK = 0x80fffffu. Now it works and I get twice like you said. — , Mar 27 '13 at 15:08

score 123 · Accepted Answer · edited Jun 06 '19 at 03:49

123

Here are theoretical max FLOPs counts (per core) for a number of recent processor microarchitectures and explanation how to achieve them.

In general, to calculate this look up the throughput of the FMA instruction(s) e.g. on https://agner.org/optimize/ or any other microbenchmark result, and multiply
(FMAs per clock) * (vector elements / instruction) * 2 (FLOPs / FMA).
Note that achieving this in real code requires very careful tuning (like loop unrolling), and near-zero cache misses, and no bottlenecks on anything else. Modern CPUs have such high FMA throughput that there isn't much room for other instructions to store the results, or to feed them with input. e.g. 2 SIMD loads per clock is also the limit for most x86 CPUs, so a dot product will bottleneck on 2 loads per 1 FMA. A carefully-tuned dense matrix multiply can come close to achieving these numbers, though.

If your workload includes any ADD/SUB or MUL that can't be contracted into FMAs, the theoretical max numbers aren't an appropriate goal for your workload. Haswell/Broadwell have 2-per-clock SIMD FP multiply (on the FMA units), but only 1 per clock SIMD FP add (on a separate vector FP add unit with lower latency). Skylake dropped the separate SIMD FP adder, running add/mul/fma the same at 4c latency, 2-per-clock throughput, for any vector width.

Intel

Note that Celeron/Pentium versions of recent microarchitectures don't support AVX or FMA instructions, only SSE4.2.

Intel Core 2 and Nehalem (SSE/SSE2):

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

Intel Sandy Bridge/Ivy Bridge (AVX1):

8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication

Intel Haswell/Broadwell/Skylake/Kaby Lake/Coffee/... (AVX+FMA3):

16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
(Using 256-bit vector instructions can reduce max turbo clock speed on some CPUs.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 1 FMA units: some Xeon Bronze/Silver

16 DP FLOPs/cycle: one 8-wide FMA (fused multiply-add) instruction
32 SP FLOPs/cycle: one 16-wide FMA (fused multiply-add) instruction
Same computation throughput as with narrower 256-bit instructions, but speedups can still be possible with AVX512 for wider loads/stores, a few vector operations that don't run on the FMA units like bitwise operations, and wider shuffles.
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed, so "cycles" isn't a constant in your performance calculations.)

Intel Skylake-X/Skylake-EP/Cascade Lake/etc (AVX512F) with 2 FMA units: Xeon Gold/Platinum, and i7/i9 high-end desktop (HEDT) chips.

32 DP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
64 SP FLOPs/cycle: two 16-wide FMA (fused multiply-add) instructions
(Having 512-bit vector instructions in flight shuts down the vector ALU on port 1. Also reduces the max turbo clock speed.)

Future: Intel Cooper Lake (successor to Cascade Lake) is expected to introduce Brain Float, a float16 format for neural-network workloads, with support for actual SIMD computation on it, unlike the current F16C extension that only has support for load/store with conversion to float32. This should double the FLOP/cycle throughput vs. single-precision on the same hardware.

Current Intel chips only have actual computation directly on standard float16 in the iGPU.

AMD

AMD K10:

4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication

AMD Bulldozer/Piledriver/Steamroller/Excavator, per module (two cores):

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA

AMD Ryzen

8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA

x86 low power

Intel Atom (Bonnell/45nm, Saltwell/32nm, Silvermont/22nm):

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
6 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication every other cycle

AMD Bobcat:

1.5 DP FLOPs/cycle: scalar SSE2 addition + scalar SSE2 multiplication every other cycle
4 SP FLOPs/cycle: 4-wide SSE addition every other cycle + 4-wide SSE multiplication every other cycle

AMD Jaguar:

3 DP FLOPs/cycle: 4-wide AVX addition every other cycle + 4-wide AVX multiplication in four cycles
8 SP FLOPs/cycle: 8-wide AVX addition every other cycle + 8-wide AVX multiplication every other cycle

ARM

ARM Cortex-A9:

1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle
4 SP FLOPs/cycle: 4-wide NEON addition every other cycle + 4-wide NEON multiplication every other cycle

ARM Cortex-A15:

2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

Qualcomm Krait:

2 DP FLOPs/cycle: scalar FMA or scalar multiply-add
8 SP FLOPs/cycle: 4-wide NEONv2 FMA or 4-wide NEON multiply-add

IBM POWER

IBM PowerPC A2 (Blue Gene/Q), per core:

8 DP FLOPs/cycle: 4-wide QPX FMA every cycle
SP elements are extended to DP and processed on the same units

IBM PowerPC A2 (Blue Gene/Q), per thread:

4 DP FLOPs/cycle: 4-wide QPX FMA every other cycle
SP elements are extended to DP and processed on the same units

Intel MIC / Xeon Phi

Intel Xeon Phi (Knights Corner), per core:

16 DP FLOPs/cycle: 8-wide FMA every cycle
32 SP FLOPs/cycle: 16-wide FMA every cycle

Intel Xeon Phi (Knights Corner), per thread:

8 DP FLOPs/cycle: 8-wide FMA every other cycle
16 SP FLOPs/cycle: 16-wide FMA every other cycle

Intel Xeon Phi (Knights Landing), per core:

32 DP FLOPs/cycle: two 8-wide FMA every cycle
64 SP FLOPs/cycle: two 16-wide FMA every cycle

The reason why there are per-thread and per-core datum for IBM Blue Gene/Q and Intel Xeon Phi (Knights Corner) is that these cores have a higher instruction issue rate when running more than one thread per core.

edited Jun 06 '19 at 03:49

Peter Cordes

328,167
45
605
847

answered Mar 27 '13 at 11:31

Marat Dukhan

11,993
4
27
41

Thanks! I see now that the the link http://stackoverflow.com/questions/8389648/how-to-achieve-4-flops-per-cycle is testing DP FLOPSs/cycle and not SP FLOPs/cycle. I wonder if I changed the code the code to be SP (_ps instead of _pd) if I will get 16 SP FLOPS/cycle on my Sandy Bridge system? For Nvidia Fermi I read http://en.wikipedia.org/wiki/GeForce_500_Series "Each SP can fulfil up to two single precision operations FMA per clock". I guess that's similar to Haswell which can do 2 FMA instructions/cycle. – Mar 27 '13 at 13:22
If you change `_ps` to `_pd` you will double the performance. Whether you will get 16 SP FLOPs/cycle depends on the other parts of your code (e.g. how many memory loads it perform). – Marat Dukhan Mar 27 '13 at 14:03
Is there a reason you wrote SSE2 for DP and only SSE for SP? I thought SSE2 and SSE were the same for floating point and the main difference was that SSE2 added integer support. – Mar 27 '13 at 15:27
3

DP support was added in SSE2 as well – Marat Dukhan Mar 27 '13 at 15:30
What about AVX2 in Intel MIC (Xeon PHI)? – osgx Oct 20 '13 at 03:25
@osgx Added Xeoh Phi. However, it does not support AVX2. – Marat Dukhan Oct 20 '13 at 03:57
@MaratDukhan: Excellent list, thank you. Could you add Cortex-A8? (and M0/M3/M4?) – Alex I Nov 24 '13 at 10:20
@Alex I do not have details for these Cortex processors – Marat Dukhan Nov 26 '13 at 19:04
3

Cortex-M0 and M3 don’t even have FPUs, so they do zero FLOPs/cycle. Even on M4 the FPU is optional. Cortex-A8 can do 2 SP FLOPs/cycle with NEON. Double-precision … well, VFP *isn't pipelined* on A8, so it’s about 1/8 DP FLOPs/cycle. – Stephen Canon Dec 05 '13 at 20:53
It's worth noting that the AMD Bulldozer/Piledriver/Steamroller processors use a shared FP unit (two cores per FP unit). Thus, the Intel CPUs offer twice the performance of the AMD CPUs, because each Intel core has its own FP unit. – i_grok Apr 28 '14 at 15:34
Are the Bulldozer/Piledriver/Steamroller numbers for one core or for one module? – netvope May 02 '14 at 00:19
3

@netvope They are per-module – Marat Dukhan May 03 '14 at 02:23
have you got a reference for this data or did you produce it yourself? – fommil Nov 22 '14 at 12:24
Data is from my tests – Marat Dukhan Nov 25 '14 at 00:20
For BGQ, you should add the "per core" caveat just like Xeon Phi. A single hardware thread cannot issue FMA on consecutive cycles; therefore 2+ threads per core are required to achieve the peak flop rate of 8 per cycle. – Jeff Hammond Apr 07 '15 at 04:48
How does CortexA7 compare with CortexA9? I'm interested in the raspberry pi2. – Z boson Jun 17 '15 at 12:56
For Cortex-A9 you write "1.5 DP FLOPs/cycle: scalar addition + scalar multiplication every other cycle" How does "scalar addition + scalar multiplication every other cycle" equal 1.5? Shouldn't it be 1.0? – Z boson Jun 24 '15 at 15:47
Do you mean one scalar mult every other cycle and one addition every cycle? That would be 1.5 DP FLOPs/cycle. – Z boson Jun 24 '15 at 15:58
@Zboson In 2 cycles Cortex-A9 can do one FMLA (2 FLOPs) + one FADD (1 FLOP) – Marat Dukhan Jun 25 '15 at 05:15
@DylRicho I don't have access to those platforms – Marat Dukhan Sep 17 '15 at 13:07
@MaratDukhan Okay, thank you anyway. May I ask what AMD FX processor (and/or which APU) you tested to get those figures? – DylRicho Sep 17 '15 at 14:13
AMD FX-6300, AMD A10-7850K, and some Bulldozer-based Opteron (don't remember the model and don't have access to it anymore) – Marat Dukhan Sep 17 '15 at 19:27
The last entry (Intel MIC (Xeon Phi), per thread) is odd, since it leads to ~2TFlop/s for a 5011P, which is twice Intel's [advertised value](http://www.intel.com/content/www/us/en/high-performance-computing/high-performance-xeon-phi-coprocessor-brief.html). Perhaps it needs the caveat "with up to two threads per core active"? – ex-bart Nov 23 '15 at 01:38
@ex-bart You interpret it incorrectly. The right interpretation is "what performance I can get if I run single-threaded code on Xeon Phi". If you run more than 1 thread, you are limited by per-core performance. – Marat Dukhan Nov 23 '15 at 05:20
@MaratDukhan Ah, I see, thanks. And the Sandy Bridge/Ivy Bridge don't state it explicitly but they are both "per core" and "per thread", right? I.e. you could keep one core's floating point units busy with just one thread, given the right benchmark? – ex-bart Nov 23 '15 at 15:01
@ex-bart Yes, other processors (except PowerPC A2) have the same per-core and per-thread peak – Marat Dukhan Nov 24 '15 at 20:46
1

It would be helpful with some references or explanation of how to obtain this information. – May 07 '16 at 06:23
Does this mean Intel i7-3770K CPU @ 3.50GHz(Intel Sandy Bridge) worse then Intel i5-4210M CPU @ 2.60GHz(Haswell architecture) in terms of flops? It does not seem right. – wittyurchin Aug 10 '17 at 05:39
1

Skylake-X comes in configurations with either 1 or 2 AVX512 FMA units... https://software.intel.com/en-us/forums/intel-isa-extensions/topic/737959 – michaf Oct 16 '17 at 07:25
@michaf As far as I know (#IamIntel), all of the Xeon W and Skylake-X SKUs have 2 FMA units. I aggregated all of the public information here: https://github.com/jeffhammond/vpu-count. – Jeff Hammond Jul 02 '18 at 23:53
@MaratDukhan Since this is the most popular source of information about this topic on the internet :-), you should add Cavium ThunderX2. WikiChip or other sources should provide the necessary info. – Jeff Hammond Jul 02 '18 at 23:55

score 21 · Answer 2 · answered Jul 24 '13 at 13:35

21

The throughput for Haswell is lower for addition than for multiplication and FMA. There are two multiplication/FMA units, but only one f.p. add unit. If your code contains mainly additions then you have to replace the additions by FMA instructions with a multiplier of 1.0 to get the maximum throughput.

The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.

This is possible indeed, but who would make such a weird optimization for one specific processor?

answered Jul 24 '13 at 13:35

A Fog

4,360
1
30
32

You don't need to manually break the loop, a little bit of compiler unrolling and out-of-order HW (assuming you don't have dependencies) can let you reach a considerable throughput bottleneck. Add to that hyperthreading and 2 operations per clock become quite necessary. – Leeor Nov 23 '13 at 15:15
1

@Leeor, maybe you could post some code to show this? Unrolling 10 times with FMA gives me the best result. See my answer at http://stackoverflow.com/questions/21090873/loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell/21600232#21600232 – Z boson Feb 06 '14 at 19:50
3

Most HPC codes that are compute-bound (i.e. flop-bound) do a lot of FMA. In my experience, the places where one does a lot of add are bandwidth-bound such that more add throughput won't help. – Jeff Hammond Jan 15 '16 at 14:49
2

The newest Intel generation has a more balanced throughput. Floating point addition, multiplication and FMA all have a throughput of 2 instructions per clock cycle and a latency of 4. – A Fog Jan 16 '16 at 16:06

FLOPS per cycle for sandy-bridge and haswell SSE2/AVX/AVX2

2 Answers2

Intel

AMD

x86 low power

ARM

IBM POWER

Intel MIC / Xeon Phi

Linked

Related