I suggest using bitwise and with a mask. Positive and negative values have the same representation, only the most significant bit differs, it is 0 for positive values and 1 for negative values, see double precision number format. You can use one of these:
inline __m128 abs_ps(__m128 x) {
static const __m128 sign_mask = _mm_set1_ps(-0.f); // -0.f = 1 << 31
return _mm_andnot_ps(sign_mask, x);
}
inline __m128d abs_pd(__m128d x) {
static const __m128d sign_mask = _mm_set1_pd(-0.); // -0. = 1 << 63
return _mm_andnot_pd(sign_mask, x); // !sign_mask & x
}
Also, it might be a good idea to unroll the loop to break the loop-carried dependency chain. Since this is a sum of nonnegative values, the order of summation is not important:
double norm(const double* sima, const double* simb) {
__m128d* sima_pd = (__m128d*) sima;
__m128d* simb_pd = (__m128d*) simb;
__m128d sum1 = _mm_setzero_pd();
__m128d sum2 = _mm_setzero_pd();
for(int k = 0; k < 3072/2; k+=2) {
sum1 += abs_pd(_mm_sub_pd(sima_pd[k], simb_pd[k]));
sum2 += abs_pd(_mm_sub_pd(sima_pd[k+1], simb_pd[k+1]));
}
__m128d sum = _mm_add_pd(sum1, sum2);
__m128d hsum = _mm_hadd_pd(sum, sum);
return *(double*)&hsum;
}
By unrolling and breaking the dependency (sum1 and sum2 are now independent), you let the processor execute the additions our of order. Since the instruction is pipelined on a modern CPU, the CPU can start working on a new addition before the previous one is finished. Also, bitwise operations are executed on a separate execution unit, the CPU can actually perform it in the same cycle as addition/subtraction. I suggest Agner Fog's optimization manuals.
Finally, I don't recommend using openMP. The loop is too small and the overhead of distribution the job among multiple threads might be bigger than any potential benefit.