Sometimes attempting to 'optimise' C++ code by adding loops to test timing is pretty silly in general, and this is one of those cases :(
Your code LITERALLY boils down to nothing more than this:
int main()
{
TimeStamp start = Clock::now();
TimeStamp end = Clock::now();
double dt = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
cout<<dt<<endl;
return 0;
}
The compiler isn't stupid, and so it has decided to remove your inner loop (since the output is unused, and so the loop is redundant).
Even if the compiler decided to keep your loop around, you're issuing 3 memory instructions for each addition. If your ram is 1600Mhz, and your CPU is 3200Mhz, then your tests are simply proving to you that you are memory bandwidth limited. Profiling loops like this isn't useful, you'll always be better off testing a real world situation in a profiler....
Anyhow, back to the loop in question. Let's throw the code into compiler explorer and play around with some options...
https://godbolt.org/z/5SJQHb
F0: Just a basic, boring C loop.
for(int i = 0 ; i < MAX ; i++)
{
out[i] = in1[i] + in2[i];
}
The compiler outputs this inner loop:
vmovups ymm0,YMMWORD PTR [rsi+r8*4]
vmovups ymm1,YMMWORD PTR [rsi+r8*4+0x20]
vmovups ymm2,YMMWORD PTR [rsi+r8*4+0x40]
vmovups ymm3,YMMWORD PTR [rsi+r8*4+0x60]
vaddps ymm0,ymm0,YMMWORD PTR [rdx+r8*4]
vaddps ymm1,ymm1,YMMWORD PTR [rdx+r8*4+0x20]
vaddps ymm2,ymm2,YMMWORD PTR [rdx+r8*4+0x40]
vaddps ymm3,ymm3,YMMWORD PTR [rdx+r8*4+0x60]
vmovups YMMWORD PTR [rdi+r8*4],ymm0
vmovups YMMWORD PTR [rdi+r8*4+0x20],ymm1
vmovups YMMWORD PTR [rdi+r8*4+0x40],ymm2
vmovups YMMWORD PTR [rdi+r8*4+0x60],ymm3
Unrolled, dealing with 32xfloats per iteration (in AVX2) [+extra code to handle up to 31 elements at the end of the iteration]
F1: Your SSE 'optimised' loop above. (Obviously this code doesn't handle up to 3 elements at the end of the loop)
for(int i = 0 ; i < MAX ; i+=4)
{
__m128 a = _mm_load_ps(&in1[i]);
__m128 b = _mm_load_ps(&in2[i]);
__m128 result = _mm_add_ps(a,b);
_mm_store_ps(&out[i],result);
}
This outputs:
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4]
vmovaps XMMWORD PTR [rdi+rcx*4],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x10]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x10]
vmovaps XMMWORD PTR [rdi+rcx*4+0x10],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x20]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x20]
vmovaps XMMWORD PTR [rdi+rcx*4+0x20],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x30]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x30]
vmovaps XMMWORD PTR [rdi+rcx*4+0x30],xmm0
So the compiler has unrolled the loop, but it's fallen back to SSE (as requested), so is now half the performance of the original loop (not quite true - memory bandwidth will be the limiting factor here).
F2: Your manually unrolled C++ loop (with the indices corrected, and still fails to handle the last 3 elements)
for(int i = 0 ; i < MAX ; i += 4)
{
out[i + 0] = in1[i + 0] + in2[i + 0];
out[i + 1] = in1[i + 1] + in2[i + 1];
out[i + 2] = in1[i + 2] + in2[i + 2];
out[i + 3] = in1[i + 3] + in2[i + 3];
}
And the output:
vmovss xmm0,DWORD PTR [rsi+rax*4]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4]
vmovss DWORD PTR [rdi+rax*4],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0x4]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0x4]
vmovss DWORD PTR [rdi+rax*4+0x4],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0x8]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0x8]
vmovss DWORD PTR [rdi+rax*4+0x8],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0xc]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0xc]
vmovss DWORD PTR [rdi+rax*4+0xc],xmm0
Well, this has completely failed to vectorise! It's simply processing 1 addition at a time. Well, this is usually down to pointer aliasing, so I'll change the function prototype from this:
void func(float* out, const float* in1, const float* in2, int MAX);
to this: (F4)
void func(
float* __restrict out,
const float* __restrict in1,
const float* __restrict in2,
int MAX);
and now the compiler will output something that is vectorised:
vmovups xmm0,XMMWORD PTR [rsi+rcx*4]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4]
vmovups xmm1,XMMWORD PTR [rsi+rcx*4+0x10]
vaddps xmm1,xmm1,XMMWORD PTR [rdx+rcx*4+0x10]
vmovups XMMWORD PTR [rdi+rcx*4],xmm0
vmovups xmm0,XMMWORD PTR [rsi+rcx*4+0x20]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x20]
vmovups XMMWORD PTR [rdi+rcx*4+0x10],xmm1
vmovups xmm1,XMMWORD PTR [rsi+rcx*4+0x30]
vaddps xmm1,xmm1,XMMWORD PTR [rdx+rcx*4+0x30]
vmovups XMMWORD PTR [rdi+rcx*4+0x20],xmm0
vmovups XMMWORD PTR [rdi+rcx*4+0x30],xmm1
HOWEVER this code is still half the performance of the first version....