0

I am trying to find the most efficient way to multiply two 2dim-array (Single Precision) in C and started with the naive idea to implement it by following the arithmetic rules:

for (i = 0; i < n; i++) {
sum += a[i] * b[i]; }

It worked, but was probably not the fastest routine on earth. Switching to pointer arithmetics and doing some loop unrolling the speed improved. However, when applying SIMD the speed dropped again.
To be more precise: Compiled on Intel oneAPI with -O3 on a Intel Core i5-4690, 3.5 GHz I see the following results:

  • Naive implementation: Approx. 800 MFlop/s
  • Using Pointer - Loop unrolling: Up to 5 GFlop/s
  • Applying SIMD: 3,5 - 5 GFlop/s

The speed of course varied with the size of the vectors and between the different test runs, therefore the figures above are more of indicative nature, but still raise the question why the SIMD-routine does not give a significant push:

float hsum_float_avx(float *pt_a, float *pt_b) {
__m256 AVX2_vect1, AVX2_vect2, res_mult, hsum;
float sumAVX;

// load unaligned memory into two vectors

AVX2_vect1 = _mm256_loadu_ps(pt_a);
AVX2_vect2 = _mm256_loadu_ps(pt_b);

// multiply the two vectors

res_mult = _mm256_mul_ps(AVX2_vect1, AVX2_vect2);

// calculate horizontal sum of resulting vector

hsum = _mm256_hadd_ps(res_mult, res_mult);
hsum = _mm256_add_ps(hsum, _mm256_permute2f128_ps(hsum, hsum, 0x1));

// store result

_mm_store_ss(&sumAVX, _mm_hadd_ps(_mm256_castps256_ps128(hsum), _mm256_castps256_ps128(hsum)));

return sumAVX; }

There must be something wrong, but I cannot find it - therefore any hint would be highly appreciated.

MarioH
  • 1
  • 1
  • 3
    The horizontal add instructions are not known for being particularly fast. – Shawn May 11 '21 at 18:22
  • 3
    And as such, it's often helpful to accumulate vector-wise, i.e. something like `sum = _mm256_add_ps(sum, res_mult);` and only `hadd` once when the loop is all finished. – Nate Eldredge May 11 '21 at 18:29
  • 1
    These days I'd probably stick with the original loop using an OpenMP SIMD pragma to explicitly request that the compiler vectorize it (if it isn't already doing so when you check the assembly; might require tweaking optimization options) – Shawn May 11 '21 at 18:34
  • 1
    Your text says “two 2dim-array” but your code looks like it is computing one dot product of two one-dimensional vectors. You should clarify the question. If you are multiply two two-dimensional arrays, cache issues are a huge factor. – Eric Postpischil May 11 '21 at 18:42
  • You are right - maybe the explanation was misleading: In fact I am multipying two 2dim-arrays. I thought about the cache issue as well, but the maximum size of the array is in the range of 10.000 meaning that I have 2 x 80 kB where even L1-cache should not be the issue. – MarioH May 11 '21 at 19:59
  • I deliberately wanted not to take OpenMP SIMD as I my intention was to understand more about the principle of SSE/ AVX. I will give it a try and post the results. – MarioH May 11 '21 at 20:02
  • 1
    Related: [Dot Product of Vectors with SIMD](//stackoverflow.com/q/47405717) shows how compilers vectorize and unroll (if you let them pretend FP math is associative via `-ffast-math`). Doing SIMD horizontal sum *once* at the end, not inside the inner loop!! But if you actually have matrices, so you're doing a row * column dot product for each element of the output, you may not want to do them all separately even if you have one transposed so rows in one and columns in the other are contiguous in memory. See also [What Every Programmer Should Know About Memory?](//stackoverflow.com/q/8126311) – Peter Cordes May 11 '21 at 20:05
  • If you want to understand more, see [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) re: unrolling a dot product with multiple accumulators to hide FP latency. – Peter Cordes May 11 '21 at 20:05
  • A simple integer SIMD example is [Integer dot product using SSE/AVX?](https://stackoverflow.com/q/23186348) - notice no horizontal / shuffle stuff in the loop. And finally found a Q&A with the same performance bug as yours, and a simple answer: [How can i optimize my AVX implementation of dot product?](https://stackoverflow.com/q/30561779) shows moving the hsum out of the loop. – Peter Cordes May 11 '21 at 20:11
  • And [AVX2: Computing dot product of 512 float arrays](https://stackoverflow.com/q/59494745) puts it all together with intrinsics. – Peter Cordes May 11 '21 at 20:17

1 Answers1

1

If your compiler supports OpenMP 4.0 or later, I'd use that to request that the compiler vectorize the original loop (Which it might already be doing so if using a high enough optimization level; but OpenMP lets you give hints about things like alignment etc. to improve the results). That has the advantage over AVX intrinsics that it'll work on other architectures like ARM, or with other x86 SIMD instruction sets (Assuming you tell the compiler to target them) with just a simple recompilation instead of having to rewrite your code:

float sum = 0.0f;
#pragma omp simd reduction(+:sum) 
for (i = 0; i < n; i++) {
    sum += a[i] * b[i];
}
Shawn
  • 47,241
  • 3
  • 26
  • 60
  • Indeed, and you can look at the compiler-generated asm to see how it vectorized. (Hopefully to at least 4 vector accumulators, preferably 8. [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527)) – Peter Cordes May 11 '21 at 18:58
  • Thanks a lot to all: I will try the omp simd pragma as well as any other hint. Honestly speaking I have (successfully) avoided Assembler now for more than 40 years and I am not sure whether this old horse will learn new tricks ... :-) But I am afraid that I will not further advance with this topic unless digging deeper into it. – MarioH May 11 '21 at 20:11
  • NOTE: I just compiled with the omp simd pragma but unfortunately there is no major difference. – MarioH May 11 '21 at 20:22
  • @MarioH Did you remember to compile with OpenMP support enabled? And did you check to see if your compiler was already vectorizing the loop? – Shawn May 11 '21 at 20:23
  • 1
    Both gcc and clang do a rather [poor job](https://gcc.godbolt.org/z/oGdzarWxf) at generating an efficient reduction step at the end of the loop. Also, gcc uses a single accumulator vector, which prevents leveraging ILP. The bottom line is, if you want your code fast, you should write it yourself. – Andrey Semashev May 12 '21 at 00:20