Why a CPU can't do FLOPs performance equivalent to it's Ghz in a simple loop?

Question

I was wondering why a simple loop such as this one can't hit my CPU clock speed (4,2Ghz):

float sum = 0;    
for (int i = 0; i < 1000000; i+=1) {
    sum = sum * 1 + 1;
}

Intuitively I would expect to achieve this in less than 1ms (like 0,238ms), doing 4.2 billion iteration per second. But I get about 3ms, which is about 333 million iterations per second.

I assume doing the math is 2 cycles, one for the multiplication and another for the sum. So let's say I'm doing 666 million operations... still seems slow. Then I assumed that the loop comparison takes a cycle and the loop counter takes another cycle...

So I created the following code to remove the loop...

void listOfSums() {
    float internalSum = 0;
    internalSum = internalSum * 1 + 1;
    internalSum = internalSum * 1 + 1;
    internalSum = internalSum * 1 + 1;
    internalSum = internalSum * 1 + 1;
    // Repeated 100k times

To my surprise it's slower, now this takes 10ms. Leading to only 100 million iterations per second.

Given that modern CPU use pipelining, out of order execution, branch prediction... it seems that I'm unable to saturate the 4,2Ghz speed by just doing two floating point operations inside a loop.

Is it safe to then assume that the 4,2Ghz is only achievable with SIMD to fully saturate the CPU core with tasks and doing a simple loop will get you about 1/6 the Ghz floating point performance? I've tried different processors and 1/6 seems to be in the ballpark (Intel, iPhone, iPad).

What's exactly the bottleneck? The CPU ability to parse the instruction? Which only can be surpassed with SIMD?

The main bottleneck on modern computers is the memory. The 100k lines will fill the instruction cache very fast and the instructions reuse being low (none), every instruction has to be fetched from memory. That may be the reason why the second example is slower. Interestingly that's also why unwinding loops can lead to slower execution — Elzaidir, Jul 08 '22 at 01:43
What compiler options for what CPU? I assume you didn't use `-ffast-math`, so the compiler couldn't break the FP dependency chain. Re: understanding FP dep chains, [Latency bounds and throughput bounds for processors for operations that must occur in sequence](https://stackoverflow.com/q/63095394) / [Why does this code execute more slowly after strength-reducing multiplications to loop-carried additions?](https://stackoverflow.com/q/72306573) (especially my answer there), and [Unrolling FP loops with multiple accumulators](https://stackoverflow.com/q/45113527) — Peter Cordes, Jul 08 '22 at 02:32
If you forgot to enable optimization, that also adds a store/reload to the critical path latency. The `*1` probably optimizes away, leaving one `addss` per 4 (ALU) + 5 (store-forwarding) cycles or so; sounds about right for Skylake with optimization disabled (which is useless for benchmarking). [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) — Peter Cordes, Jul 08 '22 at 02:35
See also [How do I achieve the theoretical maximum of 4 FLOPs per cycle?](https://stackoverflow.com/q/8389648) — Peter Cordes, Jul 08 '22 at 03:24
Also BTW, none of your assumptions about how longs things take on CPUs are correct. Times don't add linearly for independent things that short on a modern *superscalar* out-of-order exec CPU. The whole point is to overlap execution of multiple things. [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) — Peter Cordes, Jul 08 '22 at 03:25

score 6 · Accepted Answer · answered Jul 08 '22 at 01:32

It is typical that a current processor can issue one or more floating-point additions in each processor cycle and can issue one or more floating-point multiplications in each cycle. It is also typical that a floating-point addition or multiplication will take four cycles to complete. This means, once you have started four floating-point additions—one in cycle n, one in cycle n+1, one in cycle n+2, and one in cycle n+3—the processor may be completing one addition per cycle—one in cycle n+4 (while a new one also starts in cycle n+4), one in n+5, and so on.

However, in order to start a floating-point operation, the inputs to that operation must be ready. Once sum * 1 has been started in cycle n, its result might not be ready until cycle n+4. So the addition of 1 will start in cycle n+4. And that addition will not complete until cycle n+8. And then the multiplication in the next iteration that uses that result cannot start until cycle n+8. Thus, with the nominal loop structure shown, one floating-point addition or multiplication will be completed every four cycles.

If instead you try:

float sum0 = 0;
float sum1 = 0;
float sum2 = 0;
float sum3 = 0;    
for (int i = 0; i < 1000000; i += 1)
{
    sum0 = sum0 * 1 + 1;
    sum1 = sum1 * 1 + 1;
    sum2 = sum2 * 1 + 1;
    sum3 = sum3 * 1 + 1;
}

then you may find four times as many floating-point operations are completed in the same time.

These details vary from processor model to processor model. Some processors might start working on certain instructions before all of their inputs are ready, some might offer early forwarding of results directly to other instruction execution units before the result is delivered to the regular result register, and so on, so the obtained performance is hugely dependent on processor characteristics.

Regarding the listOfSums example, the code grossly exceeds the size of L1 cache, and therefore the processor must load each instruction from memory before executing it, which greatly reduces the performance.

Code-fetch can probably keep up with low-throughput code like this that bottlenecks on FP latency, thanks to HW prefetch into L2 and so on. (Single-core fetch bandwidth from DRAM on a typical desktop/laptop is around 8 bytes per cycle with fast DDR4, and this code probably compiles to movss-load/addss-constant/movss store for each iteration, if optimization is disabled. `*=1` may optimize away). I'd expect the problem is the very first time through this code, it has to page-fault in code for new pages! So the timed region includes page faults. — Peter Cordes, Jul 08 '22 at 02:49
Yup, https://godbolt.org/z/T31hd3r8n - GCC optimizes away the `*1` no-op even at `-O0` — Peter Cordes, Jul 08 '22 at 03:02

Why a CPU can't do FLOPs performance equivalent to it's Ghz in a simple loop?

1 Answers1