I vectorized the following loop, that crops up in an application that I am developing:
void vecScl(Node** A, Node* B, long val){
int fact = round( dot / const);
for(i=0; i<SIZE ;i++)
(*A)->vector[i] -= fact * B->vector[i];
}
And this is the SSE code:
void vecSclSSE(Node** A, Node* B, long val){
int fact = round( dot / const);
__m128i vecPi, vecQi, vecCi, vecQCi, vecResi;
int sseBound = SIZE/4;
for(i=0,j=0; j<sseBound ; i+=4,j++){
vecPi = _mm_loadu_si128((__m128i *)&((*A)->vector)[i] );
vecQi = _mm_set_epi32(fact,fact,fact,fact);
vecCi = _mm_loadu_si128((__m128i *)&((B)->vector)[i] );
vecQCi = _mm_mullo_epi32(vecQi,vecCi);
vecResi = _mm_sub_epi32(vecPi,vecQCi);
_mm_storeu_si128((__m128i *) (((*A)->vector) + i), vecResi );
}
//Compute remaining positions if SIZE % 4 != 0
for(; i<SIZE ;i++)
(*A)->vector[i] -= q * B->vector[i];
}
While this works in terms of correctness, the performance is exactly the same with and without SSE. I am compiling the code with:
g++ *.cpp *.h -msse4.1 -march=corei7-avx -mtune=corei7-avx -mno-avx -mno-aes -Warray-bounds -O2
Is this because I am not allocating (and use the SSE functions accordingly) aligned memory? The code is very complicated to change, so I was kind of avoiding that for now.
BTW, in terms of further improvements, and considering that I am bounded to the Sandy Bridge architecture, what is the best that I can do?
EDIT: The compiler is not vectorizing the code yet. First, I changed the data types of the vectors to short
s, which doesn't change performance. Then, I compiled with -fno-tree-vectorize
and the performance is the same.
Thanks a lot