5

Basically how can I write the equivalent of this with AVX2 intrinsics? We assume here that result_in_float is of type __m256, while result is of type short int* or short int[8].

for(i = 0; i < 8; i++)
    result[i] = (short int)result_in_float[i];

I know that floats can be converted to 32 bit integers using the __m256i _mm256_cvtps_epi32(__m256 m1) intrinsic, but have no idea how to convert these 32 bit integers further to 16 bit integers. And I don't want just that but also to store those values (in the form of 16 bit integers) to the memory, and I want to do that all using vector instructions.

Searching around the internet, I found an intrinsic by the name of_mm256_mask_storeu_epi16, but I'm not really sure if that would do the trick, as I couldn't find an example of its usage.

pythonic
  • 20,589
  • 43
  • 136
  • 219

1 Answers1

6

_mm256_cvtps_epi32 is a good first step, the conversion to a packed vector of shorts is a bit annoying, requiring a cross-slice shuffle (so it's good that it's not in a dependency chain here).

Since the values can be assumed to be in the right range (as per the comment), we can use _mm256_packs_epi32 instead of _mm256_shuffle_epi8 to do the conversion, either way it's a 1-cycle instruction on port 5 but using _mm256_packs_epi32 avoids having to get a shuffle mask from somewhere.

So to put it together (not tested)

__m256i tmp = _mm256_cvtps_epi32(result_in_float);
tmp = _mm256_packs_epi32(tmp, _mm256_setzero_si256());
tmp = _mm256_permute4x64_epi64(tmp, 0xD8);
__m128i res = _mm256_castsi256_si128(tmp);
// _mm_store_si128 that

The last step (cast) is free, it just changes the type.

If you had two vectors of floats to convert, you could re-use most of the instructions, eg: (not tested either)

__m256i tmp1 = _mm256_cvtps_epi32(result_in_float1);
__m256i tmp2 = _mm256_cvtps_epi32(result_in_float2);
tmp1 = _mm256_packs_epi32(tmp1, tmp2);
tmp1 = _mm256_permute4x64_epi64(tmp1, 0xD8);
// _mm256_store_si256 this
harold
  • 61,398
  • 6
  • 86
  • 164
  • You sir, are a genious :)! I tested your code and it worked! One correction though. Instead of __mm256i or __mm128i, it should be _m256i and _m128i. The exact code I used is the following. __m256i tmp = _mm256_cvtps_epi32(result_in_float); tmp = _mm256_packs_epi32(tmp, _mm256_setzero_si256()); tmp = _mm256_permute4x64_epi64(tmp, 0xD8); – pythonic Dec 19 '16 at 19:18
  • Right, single `m` there, I'll change it – harold Dec 19 '16 at 19:21
  • 3
    @pythonic and harold: For a single vector, you don't need a zeroed temporary (and only need AVX1): `_mm256_cvtps_epi32`, then `_mm256_extractf128_si256` and a cast as inputs to128bit `_mm_packs_epi32`. (I wasn't sure that 256b [VCVTPS2DQ ymm](http://www.felixcloutier.com/x86/CVTPS2DQ.html) was in AVX1, but it is.) – Peter Cordes Dec 19 '16 at 20:47