I was wondering why a simple loop such as this one can't hit my CPU clock speed (4,2Ghz):
float sum = 0;
for (int i = 0; i < 1000000; i+=1) {
sum = sum * 1 + 1;
}
Intuitively I would expect to achieve this in less than 1ms (like 0,238ms), doing 4.2 billion iteration per second. But I get about 3ms, which is about 333 million iterations per second.
I assume doing the math is 2 cycles, one for the multiplication and another for the sum. So let's say I'm doing 666 million operations... still seems slow. Then I assumed that the loop comparison takes a cycle and the loop counter takes another cycle...
So I created the following code to remove the loop...
void listOfSums() {
float internalSum = 0;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
internalSum = internalSum * 1 + 1;
// Repeated 100k times
To my surprise it's slower, now this takes 10ms. Leading to only 100 million iterations per second.
Given that modern CPU use pipelining, out of order execution, branch prediction... it seems that I'm unable to saturate the 4,2Ghz speed by just doing two floating point operations inside a loop.
Is it safe to then assume that the 4,2Ghz is only achievable with SIMD to fully saturate the CPU core with tasks and doing a simple loop will get you about 1/6 the Ghz floating point performance? I've tried different processors and 1/6 seems to be in the ballpark (Intel, iPhone, iPad).
What's exactly the bottleneck? The CPU ability to parse the instruction? Which only can be surpassed with SIMD?