0

So I'm trying my hand at optimising some code, and have run into some issues trying to vectorise the code.
I essentially have a nested loop as such:

for(int i = 0; i<N; i++)
{
  for(int j = 0; j<N; j++;)
  {
    //Bunch of calculations
    //array[i] += (x*y);
  }
}

In the process of vectorising the inner loop, x and y both become vectorised. So I have x_vector with four values in the register, and y_vector with 4 values in the register.
In order to add these to array[i], I need to perform the calculation of x_vector*y_vector, sum the four results to a single variable and then add it to array[i]. So something like this:

__m128 x_vector ....
__m128 y_vector ....

__m128 xy_vector = _mm_mul_ps(x_vector, y_vector);

//now the xy_vector has all 4 multiplication results, need to sum them to a single variable

float result = _mm_someInstruction_ps(xy_vector);
array[i] += result;

Is there an instruction stated in the intel instrinsics guide that does this? I looked into the _mm_add_ps instruction, but that returns a vector. Is there any add instruction which sums the contents of the register, then returns this result?

  • Possible duplicate of [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86) – Peter Cordes Mar 11 '18 at 12:03
  • 1
    There are no fast horizontal sums of all the floats in a vector. Vectorize over 4 iterations of the *outer* loop, so you produce `array[i+0..3]` and don't need any horizontal operations, just vertical addition. (And vertical multiply to produce the `x*y`, hopefully? **How do `x` and `y` depend on `i` and `j`?**). – Peter Cordes Mar 11 '18 at 12:06
  • The initial calculation for the y_vector is a[j] - a[i] – JustACodingGrunt Mar 11 '18 at 12:26
  • 1
    BTW, this is similar to a matrix-multiply data access pattern, if x and y are coming from memory. Definitely take a look at cache-blocking optimizations as well as optimizing the ALU side of things with choice of intrinsics. https://stackoverflow.com/questions/1303182/how-does-blas-get-such-extreme-performance. / https://stackoverflow.com/questions/35620853/how-to-write-a-matrix-matrix-product-that-can-compete-with-eigen / https://stackoverflow.com/questions/35401524/how-to-optimize-matrix-multiplication-matmul-code-to-run-fast-on-a-single-proc. – Peter Cordes Mar 11 '18 at 12:34
  • 1
    And [What Every Programmer Should Know About Memory?](https://stackoverflow.com/questions/8126311/what-every-programmer-should-know-about-memory/47714514#47714514) has an appendix with an example of a cache-blocked SSE2 matrix multiply. – Peter Cordes Mar 11 '18 at 12:35
  • @PeterCordes Really helpful replies! However, when it comes to loop blocking, how do I determine the value to "chunk" the loop up into. Ofc it has to do with the size of cache, but at the moment I don't really know how to calculate this without trial and error – JustACodingGrunt Mar 11 '18 at 12:56
  • I think you should aim to reuse data before eviction from L1D (~32kiB on most CPUs), or from L2 cache (256k on most CPUs). With a block size 1/2 or 1/4 of the cache size, that leaves some slack. It's a matter of tuning; use performance counters to measure L1d and L2 miss rates for different block sizes on your hardware. (e.g. with Linux `perf`, or `ocperf.py` (https://github.com/andikleen/pmu-tools). – Peter Cordes Mar 11 '18 at 13:04

0 Answers0