C++ AVX2 Instrinsic function Non-Standard Size

Question

I am trying to use AVX2 intrinsic functions with C++. I am using floats (__m256). Now there are 8 floats that can fit in a register. But what happens if I have less than 8 floats, say I have 5. In that case, the lower 3 floats have garbage values.

float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};

__m256 _a = _mm256_loadu_ps(a);
__m256 _b = _mm256_loadu_ps(b);

__m256 _c = _mm256_div_ps(_a, _b);

for(int i=0; i<8; ++i)
    cout << _c[i] << endl;

The result that I get it in the screenshot below:

Result

Is there any way I can the last 3 numbers in the results to 0? I don't want to run a loop since that would defeat the purpose of using AVX. Also, the number of floats (5 in this case) is variable.

I am new to AVX and would really like some help.

In the context of the larger problem, I read the arrays from a data stream and thus don't know the size of the array beforehand to be able to append 0 at the end of the arrays without running a loop.

float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f, 0.0f, 0.0f, 0.0f}; You might have to initialise the last 3 members of b to 1.0f — James, Nov 17 '19 at 16:18
Don't provide a smaller array to load than the register size. — Shawn, Nov 17 '19 at 16:19
Hello James, thanks a lot for the quick response. In the larger context of the problem, the arrays are loaded from a stream and i dont know the size beforehand. So, appending 0 at the end would require a loop from (8-arraysize) which is something i want to avoid. Is there any other solution? — nRoy, Nov 17 '19 at 16:24
If your stream is non stop, then you can wait until having 8 floats to fill in your register. That is, treat your stream in blocks of 8. — Javier Silva Ortíz, Nov 17 '19 at 16:33
You can use padding (small amount of valid data inside a longer buffer), but you can't safely load from a 5-element array. It might end right before an unmapped page. What do you *really* want to do with the result? Your current code does loop over it. (Well technically you can use a masked load, `vmaskmovps`, but then you need to convert an integer to a mask. related: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](//stackoverflow.com/q/34306933)) — Peter Cordes, Nov 17 '19 at 21:09
Using AVX this way is nonsensical. Using SSE would be nonsensical already for 5 elems at a time, but switching to AVX will give you a warmup which is about 50,000 times as long as it takes with a simple C loop. — Damon, Nov 17 '19 at 22:07
@Damon I would have expected that AVX just ran a loop in microcode... Obviously I don't know what I'm talking about... I would be interested in seeing an answer that says this which explains "warmup" — Jerry Jeremiah, Nov 18 '19 at 00:57
@JerryJeremiah: If the current turbo is above the AVX ceiling, running 256-bit vector instructions will limit their throughput to 1 per 4 clocks or something, until after 50k cycles the CPU will downclock and let them run at full throughput. Or something like that. https://www.agner.org/optimize/blog/read.php?i=415 reports seeing it on SKL. My explanation of the mechanism is a guess based on understanding gained from SKX AVX512 ([SIMD instructions lowering CPU frequency](//stackoverflow.com/q/56852812)) soft transition. It might be nothing to do with powering down upper 128. — Peter Cordes, Nov 18 '19 at 01:46

jww · Answer 1 · 2019-11-18T06:13:18.107

float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};

__m256 _a = _mm256_loadu_ps(a);
__m256 _b = _mm256_loadu_ps(b);

This is undefined behavior because you are reading beyond the array.

You can clear all the elements in _a and _b with _mm256_setzero_ps():

__m256 _a = _mm256_setzero_ps;
__m256 _b = _mm256_setzero_ps;

Loading 5 elements into the __m256 register is a little trickier. If possible, you can declare it with 8 elements. I believe C++ will value initialize with 0.0f.

float a[8] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[8] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};

If you can't declare the array with 8 elements, then I would probably try something like this with GCC and Clang:

__m256 _a = _mm256_setzero_ps(), _b = _mm256_setzero_ps();
memcpy(&_a, a, 5*sizeof(float));
memcpy(&_b, b, 5*sizeof(float));

You can also copy to an intermediate array and allow the compiler to optimize:

float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};
float t[0] = {0.0f};

memcpy(t, a, 5*sizeof(float));
__m256 _a = _mm256_loadu_ps(t);
memcpy(t, b, 5*sizeof(float));
__m256 _b = _mm256_loadu_ps(t);

(Editor's note: this will likely compile to about the same asm as memcpy into the __m256 object. With current compilers, it will actually copy to the stack and result in a store-forwarding stall when reloaded.)

A final possibility is loading one full __m128, setting the one element in a second __m128, and then combining the two __m128 into a __m256. I don't have a lot of experience with it, but this may do what you want. I did not test it:

float a[5] = {1.0f, 2.0f, 3.0f, 4.0f, 5.0f};
float b[5] = {2.0f, 3.0f, 4.0f, 5.0f, 6.0f};

__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0), _mm_load_ps1(a+4));
__m256 _b = _mm256_set_m128 (_mm_loadu_ps(b+0), _mm_load_ps1(b+4));

The _mm_load_ps1 will broadcast the the first element (a[4] or b[4]) into the remaining elements. The remaining elements will not be 0, but they won't be random garbage either. When you carry out your calculation, you treat them as "don't cares".

If you truly need the last three elements to be 0.0f, then this should do. But I believe it will cost you two extra instructions as opposed to _mm_load_ps1.

// x set to {5.0f, 0.0f, 0.0f, 0.0f}
__m128 x = _mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0);

The full statement for a would look like:

__m256 _a = _mm256_set_m128 (_mm_loadu_ps(a+0),
    _mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0));

And before you exit your routine that processes the __m256 datatypes, you may need to call _mm256_zeroupper. See questions like Using AVX CPU instructions: Poor performance without “/arch:AVX” and Using xmm parameter in AVX intrinsics.

Regardless of what you decide, you should benchmark the performance of your application to see which is best for your program.

Also see the Intel Intrinsics Guide.

If you have a compile-time-constant length, it's going to be much better to use `_mm256_maskload_ps` than to get the compiler to make some nasty asm that creates a store-forwarding stall. That instruction isn't wonderful but it's not terrible either. Also, your first version with a memcpy directly into `__m256 _a` leaves the high elements uninitialized so it doesn't answer the question. You could also `vblendps` or `vandps` to clear high bits after a load, if for some other reason you know you can't actually fault. — Peter Cordes, Nov 18 '19 at 04:47
Isn't `_mm_insert_ps(_mm_setzero_ps(), _mm_load_ps1(a+4), 0)` just `_mm_load_ss(a+4)`? At least clang seems to generate the same code. — Marc Glisse, Nov 18 '19 at 07:35
Yes, you are right Marc. I was looking for that intrinsic but did not find it on my pass through the Intel Intrinsic Guide. — jww, Nov 18 '19 at 09:00

C++ AVX2 Instrinsic function Non-Standard Size

1 Answers1