I am trying to find the most efficient way to multiply two 2dim-array (Single Precision) in C and started with the naive idea to implement it by following the arithmetic rules:
for (i = 0; i < n; i++) {
sum += a[i] * b[i]; }
It worked, but was probably not the fastest routine on earth. Switching to pointer arithmetics and doing some loop unrolling the speed improved. However, when applying SIMD the speed dropped again.
To be more precise: Compiled on Intel oneAPI with -O3 on a Intel Core i5-4690, 3.5 GHz I see the following results:
- Naive implementation: Approx. 800 MFlop/s
- Using Pointer - Loop unrolling: Up to 5 GFlop/s
- Applying SIMD: 3,5 - 5 GFlop/s
The speed of course varied with the size of the vectors and between the different test runs, therefore the figures above are more of indicative nature, but still raise the question why the SIMD-routine does not give a significant push:
float hsum_float_avx(float *pt_a, float *pt_b) {
__m256 AVX2_vect1, AVX2_vect2, res_mult, hsum;
float sumAVX;
// load unaligned memory into two vectors
AVX2_vect1 = _mm256_loadu_ps(pt_a);
AVX2_vect2 = _mm256_loadu_ps(pt_b);
// multiply the two vectors
res_mult = _mm256_mul_ps(AVX2_vect1, AVX2_vect2);
// calculate horizontal sum of resulting vector
hsum = _mm256_hadd_ps(res_mult, res_mult);
hsum = _mm256_add_ps(hsum, _mm256_permute2f128_ps(hsum, hsum, 0x1));
// store result
_mm_store_ss(&sumAVX, _mm_hadd_ps(_mm256_castps256_ps128(hsum), _mm256_castps256_ps128(hsum)));
return sumAVX; }
There must be something wrong, but I cannot find it - therefore any hint would be highly appreciated.