Simulating AVX-512 mask instructions

Question

According to the documentation, from gcc 4.9 on the AVX-512 instruction set is supported, but I have gcc 4.8. I currently have code like this for summing up a block of memory (it's guaranteed to be less than 256 bytes, so no overflow worries):

__mm128i sum = _mm_add_epi16(sum, _mm_cvtepu8_epi16(*(__m128i *) &mem));

Now, looking through the documentation, if we have, say, four bytes left over, I could use:

__mm128i sum = _mm_add_epi16(sum,
                             _mm_mask_cvtepu8_epi16(_mm_set1_epi16(0),
                                                    (__mmask8)_mm_set_epi16(0,0,0,0,1,1,1,1),
                                                    *(__m128i *) &mem));

(Note, the type of __mmask8 doesn't seem to be documented anywhere I can find, so I am guessing...)

However, _mm_mask_cvtepu8_epi16 is an AVX-512 instruction, so is there a way to duplicate this? I tried:

mm_mullo_epi16(_mm_set_epi16(0,0,0,0,1,1,1,1),
               _mm_cvtepu8_epi16(*(__m128i *) &mem));

However, there was a cache stall so just a direct for (int i = 0; i < remaining_bytes; i++) sum += mem[i]; gave better performance.

It's not clear to me what you are after. Do you have AVX512 hardware but just not a compiler to support? Not that the 128b (e.g. `_mm_mask_cvtepu8_epi16`) and 256b mask operations require `AVX-512VL` which KNL does not have. You want a SSE only solution? — Z boson, Mar 06 '17 at 08:22
What's wrong with `for (int i = 0; i < remaining_bytes; i++) sum += mem[i]`? — Z boson, Mar 06 '17 at 08:35
@Zboson, yes, I cannot easily upgrade the compiler at this point. There's nothing particularly wrong with the simple `for` loop, but I just wondered if there was a better way, as the actual loop body is a bit more complicated than just summation - I've got a circle and am calculating the relative weights between the (left and right) and (top and bottom) halves. — Ken Y-N, Mar 06 '17 at 09:11
What AVX512 hardware do you have? What exactly is your hardware? — Z boson, Mar 06 '17 at 09:15
@Zboson I have a Core i7-4810MQ; that would appear not to support AVX512... Therefore I would like to find a relatively efficient way of simulating a masked load. — Ken Y-N, Mar 06 '17 at 09:22
@KenY-N: your CPU is a Haswell, which has AVX2, so you can use AVX2 masked loads, e.g. [_mm_maskload_epi32](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2&text=_mm_maskload_epi32&expand=3274,3274), if that helps ? — Paul R, Mar 06 '17 at 09:26
You're summing unsigned bytes, so you should use `_mm_sad_epu8(_mm_load_si128(mem), _mm_setzero_si128)` instead of `pmovzx` + `paddw`). — Peter Cordes, Sep 10 '17 at 19:34

zinga · Answer 1 · 2017-09-10T22:11:39.987

As I happened to stumble across this question, and it still hasn't gotten an answer, if this is still a problem...

For your example problem, you're on the right track.

Multiply is a relatively slow operation, so you should avoid the use of _mm_mullo_epi16. Use _mm_and_si128 instead as bitwise AND is a much faster operation, e.g. _mm_and_si128(_mm_cvtepu8_epi16(*(__m128i *) &mem), _mm_set_epi32(0, 0, -1, -1))
I'm not sure what you mean by a cache stall, but if memory access is a bottleneck, and the compiler won't put the constant for the above into a register, you could use something like _mm_srli_si128(vector, 8) which doesn't need any additional registers/memory loads. A shift may be slower than an AND.
If it's always 8 bytes, you can use _mm_move_epi64
None of this solves the case if the remaining number isn't a fixed number of elements (e.g. you have n%16 bytes for some arbitrary n). ~~Note that AVX-512 doesn't really solve it either.~~ If you need to deal with this case, you could have a table of masks and AND depending on what's remaining, e.g. _mm_and_si128(vector, masks[n & 0xf])
(_mm_mask_cvtepu8_epi16 only cares about the low half of the vector, so your example is somewhat confusing - that is, you don't need to mask anything because the later elements are completely ignored anway)

On a more generic level, mask operations are really just an embedded _mm_blend_epi16 (or equivalent). For zeroing idioms, they can easily be emulated with _mm_and_si128 / _mm_andnot_si128, as shown above.

AVX512 does kind of solve it. You can do `__mask32 = (uint32_t)-1UL >> ((32-bytes_left)&31`, which will compile to an integer shift and a `kmov`. Of course you can generate a mask in AVX2 as well, using various techniques. See https://stackoverflow.com/questions/34306933/vectorizing-with-unaligned-buffers-using-vmaskmovps-generating-a-mask-from-a-m — Peter Cordes, Sep 10 '17 at 12:49
Good point, hadn't thought about that. Thanks for the correction! — zinga, Sep 10 '17 at 22:13

Simulating AVX-512 mask instructions

1 Answers1