0

I have an image processing algorithm to calculate a*b+c*d with AVX. The pseudo code is as follows:

float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];

//assign values to a, b, c and d
__ m256 sum;
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{

    //---Loading code---
    __m256 am=_mm256_loadu_ps(a+i);
    __m256 bm=_mm256_loadu_ps(b+i);
    __m256 cm=_mm256_loadu_ps(c+i);
    __m256 dm=_mm256_loadu_ps(d+i);

    __m256 abm=_mm256_mul_ps(am, bm);
    __m256 cdm=_mm256_mul_ps(cm, dm);
    __m256 abcdm=_mm256_add_ps(abm, cdm);
    sum=_mm256_add_ps(sum, abcdm);
}

a,b,c,d are allocated by myself, but it may be transmitted by other programs, so there is no 32 byte alignment. According to the Internet, it should be used _mm256_loadu_ps to load the data.

But! If I change the "---Loading code---" in the above to without "u" (change _mm256_loadu_ps to _mm256_load_ps), as follows:

__m256 am=_mm256_load_ps(a+i);
__m256 bm=_mm256_load_ps(b+i);
__m256 cm=_mm256_load_ps(c+i);
__m256 dm=_mm256_load_ps(d+i);

The algorithm is nearly 20% faster, no error is reported, and there is no problem from the final results of image processing. Here are two questions:

  1. Why is it used _mm256_load_ps will be faster and correct and the result is right?
  2. How to understand Intel about _mm256_load_ps Description: "Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated."?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
New Coder
  • 33
  • 7
  • 1
    Can you show how you're compiling the code (what compiler you're using and what compiler flags you're using, etc), and also how you're timing it ? – Paul R Feb 07 '21 at 12:30
  • compiler flags: -fopenmp -mavx2 -mfma -O3 and I use opencv getTickCount() to timing it. – New Coder Feb 07 '21 at 13:30
  • If you're using a decently up to date compiler, there's absolutely no point in writing this in machine specific intrinsics. – EOF Feb 07 '21 at 14:38
  • So you're saying that you need to support the case when `a,b,c,d` are not aligned - then you should do a test where they are actually not aligned, and see what happens. Your current test doesn't accomplish this, because the pointers returned from `new` might happen to be aligned properly. – Nate Eldredge Feb 07 '21 at 16:43
  • 1
    When compiling with gcc you should definitely pass `-march=native` (or some reasonable minimal target architecture), instead of ` -mavx2 -mfma`. Otherwise, gcc will split unaligned loads into two loads. – chtz Feb 07 '21 at 17:26
  • @EOF You would at least allow the compiler to do non-associative math optimizations, to get the same result from a trivial loop. – chtz Feb 07 '21 at 17:28
  • 1
    not faulting with `load` either means your data was aligned after all, or else GCC folded the loads into 256-bit memory source operands for ALU instructions (thus not requiring alignment), where it wasn't willing to do that for loadu with the default `-mtune=generic` (which favours first-gen Sandybridge and Bulldozer at the expense of current CPUs, see the linked duplicate.) – Peter Cordes Feb 08 '21 at 00:33

0 Answers0