I have an image processing algorithm to calculate a*b+c*d
with AVX. The pseudo code is as follows:
float *a=new float[N];
float *b=new float[N];
float *c=new float[N];
float *d=new float[N];
//assign values to a, b, c and d
__ m256 sum;
for (int i = 0; i < n; i += 8) // assume that n is a multiple of 8
{
//---Loading code---
__m256 am=_mm256_loadu_ps(a+i);
__m256 bm=_mm256_loadu_ps(b+i);
__m256 cm=_mm256_loadu_ps(c+i);
__m256 dm=_mm256_loadu_ps(d+i);
__m256 abm=_mm256_mul_ps(am, bm);
__m256 cdm=_mm256_mul_ps(cm, dm);
__m256 abcdm=_mm256_add_ps(abm, cdm);
sum=_mm256_add_ps(sum, abcdm);
}
a,b,c,d are allocated by myself, but it may be transmitted by other programs, so there is no 32 byte alignment. According to the Internet, it should be used _mm256_loadu_ps
to load the data.
But! If I change the "---Loading code---" in the above to without "u" (change _mm256_loadu_ps
to _mm256_load_ps
), as follows:
__m256 am=_mm256_load_ps(a+i);
__m256 bm=_mm256_load_ps(b+i);
__m256 cm=_mm256_load_ps(c+i);
__m256 dm=_mm256_load_ps(d+i);
The algorithm is nearly 20% faster, no error is reported, and there is no problem from the final results of image processing. Here are two questions:
- Why is it used
_mm256_load_ps
will be faster and correct and the result is right? - How to understand Intel about
_mm256_load_ps
Description: "Load 256-bits (composed of 8 packed single-precision (32-bit) floating-point elements) from memory into dst. mem_addr must be aligned on a 32-byte boundary or a general-protection exception may be generated."?