So I'm trying my hand at optimising some code, and have run into some issues trying to vectorise the code.
I essentially have a nested loop as such:
for(int i = 0; i<N; i++)
{
for(int j = 0; j<N; j++;)
{
//Bunch of calculations
//array[i] += (x*y);
}
}
In the process of vectorising the inner loop, x and y both become vectorised. So I have x_vector with four values in the register, and y_vector with 4 values in the register.
In order to add these to array[i], I need to perform the calculation of x_vector*y_vector, sum the four results to a single variable and then add it to array[i]. So something like this:
__m128 x_vector ....
__m128 y_vector ....
__m128 xy_vector = _mm_mul_ps(x_vector, y_vector);
//now the xy_vector has all 4 multiplication results, need to sum them to a single variable
float result = _mm_someInstruction_ps(xy_vector);
array[i] += result;
Is there an instruction stated in the intel instrinsics guide that does this? I looked into the _mm_add_ps instruction, but that returns a vector. Is there any add instruction which sums the contents of the register, then returns this result?