How to compute the single-precision data and double-precision data peak performance for Intel(R) Core(TM) i7-3770 CPU

Question

How to compute the peak performance of single-precision data and double-precision data for Intel(R) Core(TM) i7-3770 CPU. "cat /proc/cpuinfo" of linux is below,which is the last one:

processor   : 7
vendor_id   : GenuineIntel
cpu family  : 6
model       : 58
model name  : Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz
stepping    : 9
microcode   : 0x10
cpu MHz     : 1600.000
cache size  : 8192 KB
physical id : 0
siblings    : 8
core id     : 3
cpu cores   : 4
apicid      : 7
initial apicid  : 7
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid fsgsbase smep erms
bogomips    : 6784.16
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

There has been a similar question named How to compute the theoretical peak performance of CPU, the answer gave the formulas to compute peak performance, and it provide the double precision's peak performance. So how to compute single precision's performance?Could someone give two formulas to compute for both single-precision data and double-precision data respectively.

The float point data is done by SSE unit, the one of i7-3770 is SSE4.1/4.2, AVX, so the other question is that different versions of SSE provide different CPU instructions per cycle for single-precision data and double-precision? where can I find the document in details.

For recent x86 with SIMD, single precision peak performance is two times double precision peak performance. — , May 11 '14 at 17:45
The single precision peak performance of Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz is 3.4GHz*2(mul,add)*4(SIMD single precision)*4(physical core)=108.8GFLOPS? — taoyuan, May 12 '14 at 02:48
Intel apparently lists double precision figures for the i7-3770 here: http://download.intel.com/support/processors/corei7/sb/core_i7-3700_d.pdf. Single precision is double this figure (217.6 GFLOPS). — DylRicho, Jan 30 '15 at 17:33
If you count the GPU, the SP number is a lot higher (http://kyokojap.myweb.hinet.net/gpu_gflops/ - look for HD4000). — Jeff Hammond, Sep 20 '15 at 18:11

DylRicho · Answer 1 · 2022-02-08T02:44:02.400

GFLOPS Equation

For a system with one processor (and one socket), here's the equation:

GFLOPS = number of cores × core frequency (GHz) × number of operations per clock cycle

For the equation, you use physical cores, not logical (threads). Also, the number of operations a processor core can complete per second varies depending on the architecture of the processor in question, and whether you're after single or double precision figures. I'll explain this a little more below.

SSE, SSE2 and 3DNow! Instructions (ISEs)

Calculating the FLOPs performance for older processor architectures is a little more involved than the newer chips we're used to. If you don't plan on calculating the FLOPs/cycle of any chip older than a K8 or Core2, then you can gloss over this section. One thing to take away from this, though, is that instruction set extensions like these can affect the number of FLOPs/cycle a chip can run. For example, a Pentium 4 with no instruction set extensions can perform, at best, 1 FLOP/cycle in single precision. With SSE being utilized, however, it can perform 4 FLOPs/cycle in single precision. Additionally, double precision for a Pentium 4 doubles from 1 FLOP/cycle with no extensions, to 2 FLOPs/cycle using SSE2.

If SSE instructions are supported, 4 FLOPs can be executed with every clock cycle. This applies to both Intel and AMD processors that support SSE instructions.

SSE2 instructions allow for 2 FLOPs with every cycle for double precision arithmetic. SSE2 does not affect single precision. Again this applies to both vendors although be warned. A limited model range of AMD's processors supported SSE2 during the early adoption phase, and that's where the last set of instructions come in...

3DNow! instructions are only used by AMD parts. In the confines of FLOPs/cycle, the functionality is identical to SSE instructions. Therefore, AMD chips that support 3DNow! but lack SSE support, can still carry out 4 FLOPs per clock cycle for single precision. 3DNow! does not affect double precision. There are also AMD models that support both 3DNow! and SSE instructions. Why, you ask? The functionality of these instructions go beyond FLOP improvements, and one offers features that the other doesn't and vice versa. That is beyond the scope of what you're asking, but I felt it necessary to clarify to avoid confusion.

Both Intel and AMD like to calculate FLOPs/cycle with all instruction set extensions enabled, so I'd advise you to do the same.

With newer architectures, this need not be a concern. All Intel families from the Pentium III support SSE, and from the Pentium 4 support SSE2. All AMD families from the K6-2 support 3DNow!, and from the Athlon XP/MP, Duron and Sempron support SSE. SSE2 support in AMD chips didn't arrive until the Athlon 64 and its siblings, Sempron and Turion 64.

FLOPs/Cycle per Architecture

(Note the following list contains architecture names, not processor family names.)

P5 & P6 (no ISEs) + Pentium Pro & Pentium II = 1 (single); 1 (double)
P6 (Pentium III only) = 4 (single); 1 (double)
NetBurst = 4 (single); 2 (double)
Pentium M & Enhanced Pentium M = 4 (single); 2 (double)
Core, Penryn, Nehalem & Westmere = 8 (single); 4 (double)
Sandy Bridge & Ivy Bridge = 16 (single); 8 (double)
Haswell, Broadwell, Skylake (LGA1151 & Mobile), Kaby Lake & Coffee Lake = 32 (single); 16 (double)
Skylake ("Skylake-X" Core i7 & Core i9 [LGA2066]) = 128 (single); 64 (double)
Skylake ("Skylake-SP" Xeon Bronze & Xeon Silver) = 64 (single); 32 (double)
Skylake ("Skylake-SP" Xeon Gold & Xeon Platinum) = 128 (single); 64 (double)
Bonnell, Saltwell, Silvermont & Airmont = 6 (single); 1.5 (double)
MIC ("Knights Corner" Xeon Phi) = 32 (single); 16 (double)
MIC ("Knights Landing" Xeon Phi) = 64 (single); 32 (double)
K5 & K6 = 0.5 (single); 0.5 (double)
K6-2 & K6-III = 4 (single); 0.5 (double)
K7 & K8 = 4 (single); 2 (double)
K10/Stars = 8 (single); 4 (double)
Husky = 8 (single); 4 (double)
[Note] Bulldozer, Piledriver, Steamroller & Excavator = 8 (single); 4 (double)
Zen & Zen+ = 16 (single); 8 (double)
Zen 2 & Zen 3 = 32 (single); 16 (double)
Bobcat = 4 (single); 1.5 (double)
Jaguar, Puma and Puma+ = 8 (single); 3 (double)

Note — Shared FPUs mean there's one FPU for every two cores. Despite what is spread online, AMD claims the Steamroller-based A10-7850K is capable of 856 SP GFLOPs; 737 of those are the Radeon R7 integrated graphics, leaving 119 for the CPU. To achieve 119 SP GFLOPs, requires 8 FLOPs per cycle. This should apply for all variants of Bulldozer as the FPU design has remained identical throughout.

That's a great list. Can you also add (for future reference) Knights Corner and Knights Landing? — Zack, Sep 09 '15 at 11:35
@azar Thank you. I've added Skylake and Knights Corner to the list. I'm not sure about Knights Landing or Zen just yet, but I'll add them in the future when I get details. :) — DylRicho, Sep 11 '15 at 11:33
@azar I've added the details for Knights Landing. I'm expecting Zen to at least tie with Haswell in terms of FLOPS per clock cycle. :) — DylRicho, Sep 16 '15 at 21:18
@Jeff Without proof to state otherwise, I believe it to be accurate. The 6700K has a theoretical compute figure of 256 GFLOPS, using all four cores. — DylRicho, Sep 20 '15 at 16:53
@DylRicho start by reading https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kirill_Yukhin_20140711.pdf. Note how "Skylake Xeon" differs from Haswell. — Jeff Hammond, Sep 20 '15 at 17:49
By the way, http://stackoverflow.com/a/15657772 is a closely related answer that covers the per-core vs per-thread situation on Knights Corner. — Jeff Hammond, Sep 20 '15 at 18:01
And you probably know that GEN (GPU) on Intel client parts has quite a few SP flop/s, although it's not accessible by the same programming models. — Jeff Hammond, Sep 20 '15 at 18:03
@Jeff, I know about Skylake Xeon. Consumer grade Skylake chips however don't support AVX-512. — DylRicho, Sep 20 '15 at 19:57
@Jeff Regarding Knights Corner, Intel has stated the maximum theoretical compute power of a 61-core Xeon Phi to be just over 1.3 TFLOPS. That means they've used the per-core calculation, so I did as well. — DylRicho, Sep 20 '15 at 19:59
@DylRicho I was trying to suggest you add this clarification to your answer, since some may come to this page because of a server-oriented query. After all, you provide Xeon Phi data, which is obviously not in the client class. — Jeff Hammond, Sep 20 '15 at 20:00
@Jeff I see, my bad. I could add the Intel HD Graphics figures as well, although this question was specifically mentioning central processors. The only reason Xeon Phi is mentioned is because someone requested it. I had planned on keeping it consumer-oriented, since the original question was relating to the i7-3770. — DylRicho, Sep 20 '15 at 20:26

How to compute the single-precision data and double-precision data peak performance for Intel(R) Core(TM) i7-3770 CPU

1 Answers1

GFLOPS Equation

SSE, SSE2 and 3DNow! Instructions (ISEs)

FLOPs/Cycle per Architecture