Costs of new AVX512 instruction - Scatter store

Question

I'm playing around with the new AVX512 instruction sets and I try to understand how they work and how one can use them.

What I try is to interleave specific data, selected by a mask. My little benchmark loads x*32 byte of aligned data from memory into two vector registers and compresses them using a dynamic mask (fig. 1). The resulting vector registers are scattered into the memory, so that the two vector registers are interleaved (fig. 2).

Figure 1: Compressing the two data vector registers using the same dynamically created mask.

Figure 2: Scatter store to interleave the compressed data.

My code looks like the following:

void zipThem( uint32_t const * const data, __mmask16 const maskCompress, __m512i const vindex, uint32_t * const result ) {
   /* Initialize a vector register containing zeroes to get the store mask */
   __m512i zeroVec     = _mm512_setzero_epi32();
   /* Load data */
   __m512i dataVec_1   = _mm512_conflict_epi32( data );
   __m512i dataVec_2   = _mm512_conflict_epi32( data + 16 );
   /* Compress the data */
   __m512i compVec_1   = _mm512_maskz_compress_epi32( maskCompress, dataVec_1 );
   __m512i compVec_2   = _mm512_maskz_compress_epi32( maskCompress, dataVec_2 );

   /* Get the store mask by compare the compressed register with the zero-register (4 means !=) */
   __mmask16 maskStore = _mm512_cmp_epi32_mask( zeroVec, compVec_1, 4 );

   /* Interleave the selected data */
   _mm512_mask_i32scatter_epi32(
      result,
      maskStore,
      vindex,
      compVec_1,
      1
   );
   _mm512_mask_i32scatter_epi32(
      result + 1,
      maskStore,
      vindex,
      compVec_2,
      1
   );
}

I compiled everything with

-O3 -march=knl -lmemkind -mavx512f -mavx512pf

I call the method for 100'000'000 elements. To actually get an overview of the behaviour of the scatter store I repeated this measurement with different values for maskCompress. I expected some kind of dependence between the time needed for execution and the number of set bits within the maskCompress. But I observed, that the tests needed roughly the same time for execution. Here is the result of the performance test: Figure 3: Results of the measurements. The x-axis represents the number of written elements, depending on maskCompressed. The y-axis shows the performance.

As one can see, the performance is getting higher when more data is actual written to memory.

I did a little bit of research and came up to this: Instruction latency of avx512. Following the given link, the latency of the used instructions are constant. But to be honest, I am a little bit confused about this behaviour.

Regarding to the answers from Christoph and Peter, I changed my approach a little bit. Thus I have no idea how I can use unpackhi / unpacklo to interleave sparse vector registers, I just combined the AVX512 compress intrinsic with a shuffle (vpermi):

int zip_store_vpermit_cnt(
  uint32_t const * const data, 
  int const compressMask, 
  uint32_t * const result,
  std::ofstream & log   
  ) {
  __m512i data1 = _mm512_undefined_epi32();
  __m512i data2 = _mm512_undefined_epi32();
  __m512i comp_vec1 = _mm512_undefined_epi32();
  __m512i comp_vec2 = _mm512_undefined_epi32();
  __mmask16 comp_mask = compressMask;
  __mmask16 shuffle_mask;
  uint32_t store_mask = 0;
  __m512i shuffle_idx_lo = _mm512_set_epi32(
    23, 7, 22, 6, 
    21, 5, 20, 4,
    19, 3, 18, 2,
    17, 1, 16, 0 );
  __m512i shuffle_idx_hi = _mm512_set_epi32(
    31, 15, 30, 14,
    29, 13, 28, 12,
    27, 11, 26, 10,
    25, 9, 24, 8 );
  std::size_t pos = 0;
  int pcount = 0;
  int fullVec = 0;
  for( std::size_t i = 0; i < ELEM_COUNT; i += 32 ) {
    /* Loading the current data */
    data1 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i]) ) );
    data2 = _mm512_maskz_compress_epi32( comp_mask, _mm512_load_epi32( &(data[i+16]) ) );
    shuffle_mask = _mm512_cmp_epi32_mask( zero, data2, 4 );
    /* Interleaving the two vector register, depending on the compressMask */
    pcount = 2*( __builtin_popcount( comp_mask ) );
    store_mask = std::pow( 2, (pcount) ) - 1;
    fullVec = pcount / 17;
    comp_vec1 = _mm512_permutex2var_epi32( data1, shuffle_idx_lo, data2 );
    _mm512_mask_storeu_epi32( &(result[pos]), store_mask, comp_vec1 );
    pos += (fullVec) * 16 + ( ( 1 - ( fullVec ) ) * pcount ); // same as pos += ( pCount >= 16 ) ? 16 : pCount;
    _mm512_mask_storeu_epi32( &(result[pos]), (store_mask >> 16) , comp_vec2 );
    pos += ( fullVec ) * ( pcount - 16 );                     // same as pos += ( pCount >= 16 ) ? pCount - 16 : 0;

    //a simple _mm512_store_epi32 produces a segfault, because the memory isn't aligned anymore :(

  }
  return pos;
}

That way the sparse data within the two vector registers can be interleaved. Unfortunately I have to manually calculate the mask for the store. This seems to be quite expensive. One could use a LUT to avoid the calculation, but I think that is not the way it should be.

Figure 4: Results of the performance test of 4 different kinds of store.

I know that this is not the usual way, but I have 3 questions, related to this topic and I am hopefull that one can help me out.

Why should a masked store with only one set bit needs the same time as a masked store where all bits are set?
Does anyone has some experience or is there a good documentation to understand the behaviour of the AVX512 scatter store?
Is there a more easy or more performant way to interleave two vector registers?

Thanks for your help!

Sincerely

I have related problems in my current code and I will try to make an answer if I find the time (and Peter Cordes isn't faster than me). Short notes: scatter uses 2 microops for every store (and Peter and I were pretty sure that they can't be elimiated even if the bit is not set). Interleaving can be done with two `vpermi2d`. [My repo](https://github.com/akamiru/bce-next/blob/master/encode_main.cpp#L171) has some code to pack and interleave. I found it faster to compress the interleave mask and then use `vpermi2d` because compress is 2 microops where `vpermi2d` is 1 and I can reuse the mask. — Christoph Diegelmann, Sep 05 '17 at 11:14
This seems promising. But my problem is, that I do not know how many elements I want to interleave and store. If I use a shuffle I do have to determine whether the data is within the boundaries of a 512-bit vector register. That is why I used the compression. Maybe AvxPF speeds it up a little bit? I will test it — Hymir, Sep 05 '17 at 14:20
the trick is to simply write out "trash" and then override it with the extra interleaved values. This requires an single additional 32 bytes to savely be able to write after the end but that shouldn't be a problem usually. You could also first interleave, double the mask bits with pdep and then compressstore them. I think I will find time to add an answer this weekend ;) — Christoph Diegelmann, Sep 05 '17 at 20:07
This only works for continuos data within the vector registers, doesn't it? But if one wants to sparsely interleave, everything becomes a bit more complicated. I will post a solution if i find one — Hymir, Sep 05 '17 at 20:27
@Christoph: The OP is on KNL, not SKX, so scatter is only 4 uops. But yeah, it's expected that it's not at lot faster with fewer elements, especially if you aren't touching fewer cache lines. As you say, don't use scatter for interleaving vectors, use ALU shuffles! AVX512 has fantastic 2-input lane crossing shuffles like `vpermt2d` / `vpermi2d`. — Peter Cordes, Sep 07 '17 at 01:33
@Hymir: I only skimmed your code / problem so maybe this doesn't apply, but try to use `popcnt` on the compress mask or something to find out how many valid elements you have. Increment your pointer by that much. So you always do full-width stores, but they overlap with the previous one by a variable amount that depends on the packing. (Like you do for left-packing / filtering an array into another array: https://stackoverflow.com/questions/36932240/avx2-what-is-the-most-efficient-way-to-pack-left-based-on-a-mask) — Peter Cordes, Sep 07 '17 at 01:35
@PeterCordes: I did the trick with the popcnt anyways, because I don't have any chance to deteremine how much values are written regardless of whether I use a scatter store or some kind of continuos store. — Hymir, Sep 07 '17 at 16:20
Can you tell me more about vindex in your pictures ? Does it actually do anything ? Are that byte offsets ? If so are you aware that index 8 and 10 would mean that its possible to partially override the previously written value? If it's uint32_t offsets doing something other than a scatter might still be worth it but only possible is vindex is a constant. — Christoph Diegelmann, Sep 12 '17 at 14:18
@Christoph: vindex is a constant with uint32_t offsets indeed (as shown in fig. 2). The values are hex values. Maybe this leads to some irritations. And yes, it does exactly that what is shown in figure 2. The two vector registers are interleaved. This works, because I "compress" the two vectors first, using the compressMask. This way I can be sure, that all values I want to store lie continuously in the two vector register. Why should 8 and 10 be overwritten? — Hymir, Sep 12 '17 at 15:08
I though it might be decimal byte offsets which would result in 10 overwritting 2 bytes of 8. So basically you're trying to to left packing by mask with 7 elements untouched between each value. I guess it won't get faster than doing scatters then. — Christoph Diegelmann, Sep 13 '17 at 14:41
If you are just interleaving data from two vectors then it is faster to use permutation instructions and then save the interleaved result as full size vectors. Scatter instructions are slower than permute instructions, but they are useful for writing sparse data structures. — A Fog, Oct 06 '20 at 11:45

Costs of new AVX512 instruction - Scatter store

0 Answers0