I tried to port code some from the FANN Lib (neuronal network written in C) to SSE2. But the SSE2 performance got worse than the normal code. With my SSE2 implementation runs one run takes 5.50 min without 5.20 min.
How could SSE2 be slower than the normal run? Could it be because of the _mm_set_ps
? I use the Apple LLVM Compiler (XCode 4) to compile the code (all SSE extension flags are on, optimization level is -Os
).
Code without SSE2
neuron_sum +=
fann_mult(weights[i], neurons[i].value) +
fann_mult(weights[i + 1], neurons[i + 1].value) +
fann_mult(weights[i + 2], neurons[i + 2].value) +
fann_mult(weights[i + 3], neurons[i + 3].value);
SSE2 code
__m128 a_line=_mm_loadu_ps(&weights[i]);
__m128 b_line=_mm_set_ps(neurons[i+3].value,neurons[i+2].value,neurons[i+1].value,neurons[i].value);
__m128 c_line=_mm_mul_ps(a_line, b_line);
neuron_sum+=c_line[0]+c_line[1]+c_line[2]+c_line[3];