Where does the x86-64's SSE instructions (vector instructions) outperform the normal instructions. Because what I'm seeing is that the frequent loads and stores that are required for executing SSE instructions is nullifying any gain we have due to vector calculation. So could someone give me an example SSE code where it performs better than the normal code.
Its maybe because I am passing each parameter separately, like this...
__m128i a = _mm_set_epi32(pa[0], pa[1], pa[2], pa[3]);
__m128i b = _mm_set_epi32(pb[0], pb[1], pb[2], pb[3]);
__m128i res = _mm_add_epi32(a, b);
for( i = 0; i < 4; i++ )
po[i] = res.m128i_i32[i];
Isn't there a way I can pass all the 4 integers at one go, I mean pass the whole 128 bytes of pa
at one go? And assign res.m128i_i32
to po
at one go?