Here is an alternative OpenMP implementation that is simpler (and a bit faster on many-core platforms):
double dot_prod_parallel(double* v1, double* v2, int dim)
{
TimeMeasureHelper helper;
double sum = 0.;
#pragma omp parallel for reduction(+:sum)
for (int i = 0; i < dim; ++i)
sum += v1[i] * v2[i];
return sum;
}
GCC ad ICC are able to vectorize this loop in -O3
. Clang 13.0 fail to do this, even with -ffast-math
and even with explicit OpenMP SIMD instructions as well as a with loop tiling. This appears to be a bug of the Clang's optimizer related to OpenMP... Note that you can use -mavx
to use the AVX instruction set which can be up to twice as fast as SSE (default). It is available on almost all recent x86-64 PC processors.