For example, a modern i7-8700k can supposedly do ~60 GFLOPS (single-precision, source) while its maximum frequency is 4.7GHz. As far as I am aware, an instruction has to take at least one cycle to complete, so how is this possible?
-
Vectorisation? .. – zerkms Jul 07 '18 at 04:14
-
Besides things like SMD being able to handle multiple operations at once as long as they are the same (i.e. multiply X registers by Y) you seem to be extremely unaware that CPU's have multiple cores. – TomTom Jul 07 '18 at 04:21
-
1https://en.wikipedia.org/wiki/FLOPS#FLOPs_per_cycle_for_various_processors - multiple cores and multiple floating point operations occurring in parallel in each core. – FBergo Jul 07 '18 at 04:28
-
Questions about general computing hardware and software are off-topic for Stack Overflow unless they directly involve tools used primarily for programming. You may be able to get help on [Super User](https://superuser.com/). – tambre Jul 07 '18 at 04:55
-
even 2-decade old CPUs can dispatch [multiple instructions within each clock](https://electronics.stackexchange.com/q/123760/27052), so as long as the pipeline is filled you always have [more than one OP per Hz](https://stackoverflow.com/q/433105/995714) – phuclv Jul 07 '18 at 06:34
-
cross-site duplicate: [How can a CPU deliver more than one instruction per cycle?](https://electronics.stackexchange.com/q/123760/27052) – phuclv Jul 07 '18 at 06:40
1 Answers
There are multiple factors that are all multiplied together for this large effect:
- SIMD, Intel 8700k and similar processors support AVX and AVX2, which includes many instructions that operate on registers that can hold 8 floats at the same time.
- multiple cores, 8700k has 6 cores.
- fused multiply-add, part of AVX2, has both a multiplication and addition in the same instruction.
- high throughput execution. The latency (time an individual instruction takes) is not directly important to how much computation a processor can do in a unit of time. A modern CPU such as 8700k can start executing two (independent) FMAs in the same cycle (and keep in mind these are still SIMD instructions so that represents a lot of floating point operations) even through the latency of the operation is actually 4 cycles.
Multiplying all those factors together we get: 8 * 6 * 2 * 2 * 4.3 = 825 GFLOPS (matching the stats reported here). This calculation certainly does not mean that it can actually be attained. For example the processor may downclock significantly under such a workload in order to stay within its power budget, which is what Intel has been doing at least since Haswell (though the specifics have changed and it applied to server parts). Also, most real code has significant trouble feeding that many FMAs with data. Large matrix multiplications can get close though, and for example according to these stats the 8700k reached 496.7 Gflops in their SGEMM benchmark. Possibly the 8700k's max AVX2 turbo speed on 6 cores is 2.6GHz but as far as I can find it does not have an AVX offset by default (only needed when overclocked), or that GEMM is just not that close to hitting peak FLOPS.

- 61,398
- 6
- 86
- 164
-
besides multiple cores, each core can also run 2 different threads for a little bit more performance – phuclv Jul 07 '18 at 06:28
-
1Intel provides a different number. They publish GFLOPS metrics for their processors: https://www.intel.com/content/dam/support/us/en/documents/processors/APP-for-Intel-Core-Processors.pdf -- They mention that the i7-8700K achieves 355.2 GFLOPS. -- However, as you mention, some benchmarks have produced higher results. -- Intel uses the base clock for i7-8700, i.e., 3.7GHz -- Therefore, Rpeak = GPU Ghz * # cores * vector-ops * special-instr = 3.7 GHz * 6 cores * 8 DP vector ops * 2 FMA3 = 355.2 GFLops – Jaime Jun 16 '19 at 14:38