i have a question about the FP peak performance of my core i7 920. I have an application that does a lot of MAC operations (basically a convolution operation), and i am not able to reach the peak FP performance of the cpu by a factor of ~8x when using multi-threading and SSE instructions. When trying to find out what the reason was for this i ended up with a simplified code snippet, running on a single thread and not using SSE instructions which performs equally bad:
for(i=0; i<49335264; i++)
{
data[i] += other_data[i] * other_data2[i];
}
If i'm correct (the data and other_data arrays are all FP) this piece of code requires:
49335264 * 2 = 98670528 FLOPs
It executes in ~150 ms (i'm very sure this timing is correct, since C timers and the Intel VTune Profiler give me the same result)
This means the performance of this code snippet is:
98670528 / 150.10^-3 / 10^9 = 0.66 GFLOPs/sec
Where the peak performance of this cpu should be at 2*3.2 GFlops/sec (2 FP units, 3.2 GHz processor) right?
Is there any explanation for this huge gap? Because i cannot explain it.
Thanks a lot in advance, and i could really use your help!