float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
__m256 _a = _mm256_loadu_ps(a);
__m256 _b = _mm256_loadu_ps(b);
This is undefined behavior because you are reading beyond the array.
You can clear all the elements in _a
and _b
with _mm256_setzero_ps()
:
__m256 _a = _mm256_setzero_ps;
__m256 _b = _mm256_setzero_ps;
Loading 5 elements into the __m256
register is a little trickier. If possible, you can declare it with 8 elements. I believe C++ will value initialize with 0.0f.
float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[8] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
If you can't declare the array with 8 elements, then I would probably try something like this with GCC and Clang:
__m256 _a = _mm256_setzero_ps(), _b = _mm256_setzero_ps();
memcpy(&_a, a, 5*sizeof(float));
memcpy(&_b, b, 5*sizeof(float));
You can also copy to an intermediate array and allow the compiler to optimize:
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
float t[0] = {0.0f};
memcpy(t, a, 5*sizeof(float));
__m256 _a = _mm256_loadu_ps(t);
memcpy(t, b, 5*sizeof(float));
__m256 _b = _mm256_loadu_ps(t);
(Editor's note: this will likely compile to about the same asm as memcpy into the __m256
object. With current compilers, it will actually copy to the stack and result in a store-forwarding stall when reloaded.)
A final possibility is loading one full __m128
, setting the one element in a second __m128
, and then combining the two __m128
into a __m256
. I don't have a lot of experience with it, but this may do what you want. I did not test it:
float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0), _mm_load_ps1(a+4));
__m256 _b = _mm256_set_m128 (_mm_loadu_ps(b+0), _mm_load_ps1(b+4));
The _mm_load_ps1
will broadcast the the first element (a[4]
or b[4]
) into the remaining elements. The remaining elements will not be 0, but they won't be random garbage either. When you carry out your calculation, you treat them as "don't cares".
If you truly need the last three elements to be 0.0f, then this should do. But I believe it will cost you two extra instructions as opposed to _mm_load_ps1
.
// x set to {5.0f, 0.0f, 0.0f, 0.0f}
__m128 x = _mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0);
The full statement for a
would look like:
__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0),
_mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0));
And before you exit your routine that processes the __m256
datatypes, you may need to call _mm256_zeroupper
. See questions like Using AVX CPU instructions: Poor performance without “/arch:AVX” and Using xmm parameter in AVX intrinsics.
Regardless of what you decide, you should benchmark the performance of your application to see which is best for your program.
Also see the Intel Intrinsics Guide.