4

I am attempting to use SIMD instructions to speed up a dot product calculation in my C code. However, the run times of my functions are approximately equal. It'd be great if someone could explain why and how to speed up the calculation.

Specifically, I'm attempting to calculate the dot product of two arrays with about 10,000 elements in them. My regular C function is as follows:

 float my_dotProd( float const * const x, float const * const y, size_t const N ){
   // N is the number of elements in the arrays
   size_t i;
   float out=0;

   for( i=0; i < N; ++i ){
     out += x[i] * y[i];
   }

   return out;
 }

My function where I use AVX SIMD commands is as follows:

 void my_malloc( size_t nBytes, void ** ptrPtr ){
   int boundary = 32;
   posix_memalign( ptrPtr, boundary, nBytes );
 }

 float cimpl_sum_m128( __m128 x ){
   float out;
   __m128 sum = x;
   sum = _mm_hadd_ps( sum, sum );
   sum = _mm_hadd_ps( sum, sum );
   out = _mm_cvtss_f32( sum );
   return out;
 }

 float my_sum_m256( __m256 x ){
   float out1, out2;
   __m128 hi = _mm256_extractf128_ps(x, 1);
   __m128 lo = _mm256_extractf128_ps(x, 0);
   out1 = cimpl_sum_m128( hi );
   out2 = cimpl_sum_m128( lo );
   return out1 + out2;
 }

 float my_dotProd( float const * const x, float const * const y, size_t const N ){
   // N is the number of elements in the arrays
   size_t i=0;
   float out=0;
   float *tmp;

   __m256 summed, *l, *r;

   if( N > 7 ){
     my_malloc( sizeof(float) * 8, (void**) &tmp );
     summed = _mm256_set1_ps(0.0f);
     l = (__m256*) x;
     r = (__m256*) y;

     for( i=0; i < N-7; i+=8, ++l, ++r ){
       summed = _mm256_add_ps( summed, _mm256_mul_ps( *l, *r ) );
     }
     _mm256_store_ps( tmp, summed );

     out += my_sum_m256( summed );
     free( tmp );
   }

   for( ; i < N; ++i ){
     out += x[i] * y[i];
   }

   return out;
 }

My test routine is:

 int test_dotProd(){
   float *x, *y;
   size_t i, N;
   float answer, result;
   float err;

   N = 100000;  // Fails

   my_malloc( sizeof(float) * N, (void**) &x );
   my_malloc( sizeof(float) * N, (void**) &y );

   answer = 0;
   for( i=0; i<N; ++i ){
     x[i]=i; y[i]=i;
     answer += (float)i * (float)i;
   }

   result = my_dotProd( x, y, N );

   err = fabs( result - answer ) / answer;

   free( x );
   free( y );
   return err < 5e-7;
 }

And I'm using clock to measure the runtime like this:

 timeStart = clock();
 testStatus = test_dotProd();
 timeTaken = (int)( clock() - timeStart );

I realize that the my_sum_m256 operation could be made more efficient, but I think that should be a small effect on the runtime. I would have guessed that the SIMD code would have been about eight times as fast. Any thoughts?

Thank you all for your help :)

Paul R
  • 208,748
  • 37
  • 389
  • 560
user24205
  • 481
  • 5
  • 15
  • 1
    How much are the times? And there's no point in using `malloc` for tmp. That will be a lot slower than just using stack. – Sami Kuhmonen Nov 21 '17 at 05:25
  • 1
    Did you check whether your compiler autovectorized the scalar code? – EOF Nov 21 '17 at 07:14
  • Related: [How can i optimize my AVX implementation of dot product?](https://stackoverflow.com/q/30561779) shows how to move the hsum out of the loop. [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](https://stackoverflow.com/q/45113527) shows how to further optimize with multiple accumulators. – Peter Cordes May 11 '21 at 20:15
  • [AVX2: Computing dot product of 512 float arrays](https://stackoverflow.com/q/59494745) has a good manually-vectorized loop using intrinsics. – Peter Cordes May 11 '21 at 20:16

2 Answers2

14

First of all: You shouldn't assume that you can optimize better than the compiler could.

Yes, you are now using AVX instructions in your "optimized" code. But you also wrote code which the compiler now has issues unrolling, in addition to the plain vectorization.

For comparison, let's have a look at what a compiler would actually make from your "slow" C implementation, just the hot loop without footer.

ICC, compiled with -O3 -march=skylake -ffast-math:

..B1.13:
    vmovups   ymm2, YMMWORD PTR [rsi+rdi*4]
    vmovups   ymm3, YMMWORD PTR [32+rsi+rdi*4]
    vfmadd231ps ymm1, ymm2, YMMWORD PTR [r8+rdi*4]
    vfmadd231ps ymm0, ymm3, YMMWORD PTR [32+r8+rdi*4]
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.13

Clang, with the same parameters is even more pessimistic and unrolls this to the following:

.LBB0_4:
    vmovups ymm4, ymmword ptr [rsi + 4*rcx]
    vmovups ymm5, ymmword ptr [rsi + 4*rcx + 32]
    vmovups ymm6, ymmword ptr [rsi + 4*rcx + 64]
    vmovups ymm7, ymmword ptr [rsi + 4*rcx + 96]
    vfmadd132ps     ymm4, ymm0, ymmword ptr [rdi + 4*rcx]
    vfmadd132ps     ymm5, ymm1, ymmword ptr [rdi + 4*rcx + 32]
    vfmadd132ps     ymm6, ymm2, ymmword ptr [rdi + 4*rcx + 64]
    vfmadd132ps     ymm7, ymm3, ymmword ptr [rdi + 4*rcx + 96]
    vmovups ymm0, ymmword ptr [rsi + 4*rcx + 128]
    vmovups ymm1, ymmword ptr [rsi + 4*rcx + 160]
    vmovups ymm2, ymmword ptr [rsi + 4*rcx + 192]
    vmovups ymm3, ymmword ptr [rsi + 4*rcx + 224]
    vfmadd132ps     ymm0, ymm4, ymmword ptr [rdi + 4*rcx + 128]
    vfmadd132ps     ymm1, ymm5, ymmword ptr [rdi + 4*rcx + 160]
    vfmadd132ps     ymm2, ymm6, ymmword ptr [rdi + 4*rcx + 192]
    vfmadd132ps     ymm3, ymm7, ymmword ptr [rdi + 4*rcx + 224]
    add     rcx, 64
    add     rax, 2
    jne     .LBB0_4

Surprise, both compilers were already able to use AVX instructions, no intrinsic hacking required.

But what's more interesting, is that both compiler decided that one accumulation register wasn't enough to saturate the AVX pipeline, but rather used 2 respectively 4 accumulation registers. Having more operations in flight helps masking latency of the FMA, up to the point where the actual memory throughput limit is reached.

Just don't forget the -ffast-math compiler option, without it wouldn't be legal to pull the final accumulation out of the vectorized loop.


GCC, also with the same options, is actually "only" doing as good as your "optimized" solution:

.L7:
    add     r8, 1
    vmovaps ymm3, YMMWORD PTR [r9+rax]
    vfmadd231ps     ymm1, ymm3, YMMWORD PTR [rcx+rax]
    add     rax, 32
    cmp     r8, r10
    jb      .L7

However GCC was still a little bit smarter in also adding a header to that loop, so it could use vmovaps (aligned memory access) instead of vmovups (unaligned memory access) for the first load.


For completeness, with pure AVX (-O3 -march=ivybridge -ffast-math):

ICC:

..B1.12:
    vmovups   xmm2, XMMWORD PTR [r8+rdi*4]
    vmovups   xmm5, XMMWORD PTR [32+r8+rdi*4]
    vinsertf128 ymm3, ymm2, XMMWORD PTR [16+r8+rdi*4], 1
    vinsertf128 ymm6, ymm5, XMMWORD PTR [48+r8+rdi*4], 1
    vmulps    ymm4, ymm3, YMMWORD PTR [rsi+rdi*4]
    vmulps    ymm7, ymm6, YMMWORD PTR [32+rsi+rdi*4]
    vaddps    ymm1, ymm1, ymm4
    vaddps    ymm0, ymm0, ymm7
    add       rdi, 16
    cmp       rdi, rax
    jb        ..B1.12

Clang:

.LBB0_5:
    vmovups xmm4, xmmword ptr [rdi + 4*rcx]
    vmovups xmm5, xmmword ptr [rdi + 4*rcx + 32]
    vmovups xmm6, xmmword ptr [rdi + 4*rcx + 64]
    vmovups xmm7, xmmword ptr [rdi + 4*rcx + 96]
    vinsertf128     ymm4, ymm4, xmmword ptr [rdi + 4*rcx + 16], 1
    vinsertf128     ymm5, ymm5, xmmword ptr [rdi + 4*rcx + 48], 1
    vinsertf128     ymm6, ymm6, xmmword ptr [rdi + 4*rcx + 80], 1
    vinsertf128     ymm7, ymm7, xmmword ptr [rdi + 4*rcx + 112], 1
    vmovups xmm8, xmmword ptr [rsi + 4*rcx]
    vmovups xmm9, xmmword ptr [rsi + 4*rcx + 32]
    vmovups xmm10, xmmword ptr [rsi + 4*rcx + 64]
    vmovups xmm11, xmmword ptr [rsi + 4*rcx + 96]
    vinsertf128     ymm8, ymm8, xmmword ptr [rsi + 4*rcx + 16], 1
    vmulps  ymm4, ymm8, ymm4
    vaddps  ymm0, ymm4, ymm0
    vinsertf128     ymm4, ymm9, xmmword ptr [rsi + 4*rcx + 48], 1
    vmulps  ymm4, ymm4, ymm5
    vaddps  ymm1, ymm4, ymm1
    vinsertf128     ymm4, ymm10, xmmword ptr [rsi + 4*rcx + 80], 1
    vmulps  ymm4, ymm4, ymm6
    vaddps  ymm2, ymm4, ymm2
    vinsertf128     ymm4, ymm11, xmmword ptr [rsi + 4*rcx + 112], 1
    vmulps  ymm4, ymm4, ymm7
    vaddps  ymm3, ymm4, ymm3
    add     rcx, 32
    cmp     rax, rcx
    jne     .LBB0_5

GCC:

.L5:
    vmovups xmm3, XMMWORD PTR [rdi+rax]
    vinsertf128     ymm1, ymm3, XMMWORD PTR [rdi+16+rax], 0x1
    vmovups xmm4, XMMWORD PTR [rsi+rax]
    vinsertf128     ymm2, ymm4, XMMWORD PTR [rsi+16+rax], 0x1
    add     rax, 32
    vmulps  ymm1, ymm1, ymm2
    vaddps  ymm0, ymm0, ymm1
    cmp     rax, rcx
    jne     .L5

Pretty much the same optimizations applied, just a few additional operations as FMA is missing and unaligned 256bit loads are not advisable for Ivy Bridge.

Ext3h
  • 5,713
  • 17
  • 43
  • Good answer, but one minor nitpick: those arithmetic instructions are FMA3, not AVX. FMA3 is available on CPUs with AVX2 (Haswell onwards), but not on earlier CPUs with just AVX (e.g. Sandy Bridge/Ivy Bridge). OP seems to be only assuming AVX, not AVX2 or FMA3. – Paul R Nov 21 '17 at 09:08
  • 1
    @PaulR Right, my fault. Doesn't change much about the choice of compiler optimizations though. Both compilers still vectorize, unroll and interleave. – Ext3h Nov 21 '17 at 09:16
  • Indeed - I just thought it might be worth clarifying, and maybe seeing what you get for purely AVX auto-vectorization, just for curiosity's sake. – Paul R Nov 21 '17 at 09:17
  • Thanks for adding the AVX version - interesting that gcc does a poorer job than either clang or ICC. – Paul R Nov 21 '17 at 09:24
  • 2
    @PaulR Clang does by far the best job, IMHO. The Intel compiler emitted a lot of code paths optimized for very specific iteration counts, producing far too much bloat for my taste. And while it did in return use less registers, leaving more headroom for register renaming and hence speculative execution, I do prefer Clangs output as it masks instruction latencies a lot better, especially on non-Intel processors. – Ext3h Nov 21 '17 at 09:37
  • Yes, clang is pretty impressive - I've seen it do some really surprising things with SIMD code (e.g. re-interpreting hand-coded intrinsics to generate better code using different instruction sequences). – Paul R Nov 21 '17 at 09:41
  • @PaulR: AMD Piledriver has FMA3 (and FMA4 like Bulldozer). Bulldozer-family didn't get AVX2 until Excavator. So FMA3 is also available on some CPUs without AVX2, if you include non-Intel. – Peter Cordes Nov 21 '17 at 09:41
  • @PeterCordes: ah yes, good point - I don't normally do any AMD stuff (although did play with Epyc recently), so I tend to be rather Intel-centric in my world view. – Paul R Nov 21 '17 at 09:42
  • 1
    @Ext3h: it's not that unaligned 256b load is missing with `-march=ivybridge`, it's that gcc decides not to use it because if the data is *actually* misaligned at runtime (rather than just not known to be always aligned at compile time) it is a perf win to vmovups / vinsertf128. tune=generic and tune=sandybridge/ivybridge (and some AMD tunings) enable `-mavx256-split-unaligned-load` on purpose. [This is probably not ideal for `-mtune=generic` these days](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=80568). – Peter Cordes Nov 21 '17 at 09:45
  • 1
    @PaulR: I haven't run anything on an AMD CPU for a long time either, but I try to keep AMD in mind when thinking about what compilers should be doing for `tune=generic`. – Peter Cordes Nov 21 '17 at 09:46
  • 1
    @Ext3h: It would be nice if you included a Godbolt link to the code you're showing. I hadn't realized clang and ICC would also split unaligned loads. I guess only with tune=ivybridge, not with tune=generic like gcc does. – Peter Cordes Nov 21 '17 at 09:48
  • 2
    *Having more operations in flight helps masking the limited memory throughput.* [No, multiple accumulators hide the latency of `vfmaddps`](https://stackoverflow.com/a/45114487/224132), up to the point where you bottleneck on load throughput (max 2 YMM loads per clock) instead of loop-carried FMA latency (5 cycles on Haswell). *use less registers, leaving more headroom for register renaming and hence speculative execution*: Rewriting the same architectural register still needs a new PRF entry. Leaving some architectural registers untouched has a negligible effect on out-of-order window size. – Peter Cordes Nov 21 '17 at 09:55
  • 2
    @PeterCordes Thanks, added the links and replaced the bogus explanation for the additional accumulators. – Ext3h Nov 21 '17 at 10:01
2

Unfortunately the algorithm of dot product is a memory bound algorithm (number of calculation is less than number of required memory throughput). So you can't effective perform it even with using of AVX (or AVX2). I have also implemented this algorithm in similar way, but I have reached only 60% improvement of performance.

ErmIg
  • 3,980
  • 1
  • 27
  • 40