1

I am working with the Xeon Phi Knights Landing. I need to do a gather operation from an array of doubles. The list of indices comes from an array of chars. The gather operations are either _mm512_i32gather_pd or _mm512_i64gather_pd. As I understand it, I either need to convert eight chars to to eight 32-bit integers or eight chars to 64-bit integers. I have gone with the first choice for _mm512_i32gather_pd.

I have created two functions get_index and get_index2 to convert eight chars to a __m256i. The assembly for get_index is simpler than for get_index2 see https://godbolt.org/z/lhg9fX. However, in my code get_index2 is significantly faster. Why is this? I am using ICC 18. Maybe there is a better solution than either of these two functions?

#include <x86intrin.h>
#include <inttypes.h>

__m256i get_index(char *index) {                                                                                                                                      
  int64_t x = *(int64_t *)&index[0];                                                                                                                                             
  const __m256i t3 = _mm256_setr_epi8(
    0,0x80,0x80,0x80,
    1,0x80,0x80,0x80,
    2,0x80,0x80,0x80,
    3,0x80,0x80,0x80,
    4,0x80,0x80,0x80,
    5,0x80,0x80,0x80,
    6,0x80,0x80,0x80,
    7,0x80,0x80,0x80);                                                                                                                                                     

  __m256i t2 = _mm256_set1_epi64x(x);                                                                                                                                            
  __m256i t4 = _mm256_shuffle_epi8(t2, t3);                                                                                                                                      
  return t4;                                                                                                                                                                     
}                

__m256i get_index2(char *index) {
  const __m256i t3 = _mm256_setr_epi8(
    0,0x80,0x80,0x80,
    1,0x80,0x80,0x80,
    2,0x80,0x80,0x80,
    3,0x80,0x80,0x80,
    4,0x80,0x80,0x80,
    5,0x80,0x80,0x80,
    6,0x80,0x80,0x80,
    7,0x80,0x80,0x80);
  __m128i t1  = _mm_loadl_epi64((__m128i*)index);
  __m256i t2 = _mm256_inserti128_si256(_mm256_castsi128_si256(t1), t1, 1);
  __m256i t4 = _mm256_shuffle_epi8(t2, t3);
  return t4;
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 2
    KNL has *very* slow 256-bit `vpshufb ymm` (12 uops, 23c latency, 12c throughput), and 128-bit XMM is slow, too. (MMX is fast :P). See Agner Fog's tables. Why can't you use [`vpmovzxbd` or `bq`](http://felixcloutier.com/x86/PMOVZX.html) like a normal person? `__m512i _mm512_cvtepu8_epi32(__m128i a)` or `_mm256_cvtepu8_epi32`. Those are all single-uop with 2c throughput. – Peter Cordes Nov 24 '18 at 18:28
  • That doesn't explain your results, though. What loop did these functions inline into? Are you sure they didn't optimize differently somehow given different surrounding code? Otherwise IDK why a load + insert would be faster than a qword broadcast-load. Maybe some kind of front-end effect? Again we'd need to see the whole loop to guess about the front-end. – Peter Cordes Nov 24 '18 at 18:34
  • 1
    @PeterCordes, thank you for pointing out `_mm256_cvtepu8_epi32`, that's exactly what I want, the result is no faster than `get_index2` though in my code. Maybe ICC converts `get_index2` to vpmovzxbd in my code anyway. I did not think of this because I'm a bit rusty with vectorization. But now I get about a 4x improvement with manual vectorization compare to ICC auto-vectorization (with `#pragma ivdep`). I'm vectorizing [stencil code](https://en.wikipedia.org/wiki/Stencil_code). – Z boson Nov 26 '18 at 12:10

0 Answers0