0

In the following code, I can use avx2 to count the number of 1 bits in each position separately 16 bits at a time, but there are 4 missing instructions on the lines labelled loadLow16. I need an instruction that loads a 16 bit value and puts it in each 16 bits of the avx2 register (16 times). Is there an instruction to do this, or is there a better way to do this?

void countHistBits4(const uint64_t p[], uint32_t n, uint32_t hist[64]) {
  uint16_t masks[16] = {1, 1<<1, 1<<2, 1<<3, 1<<4, 1<<5, 1<<6, 1<<7,
            1<<8, 1<<9, 1<<10, 1<<11, 1<<12, 1<<13, 1<<14, 1<<16};
    __m256i mask = _mm256_load_si256((__m256*)masks);
    __m256i count1 = _mm256_setzero_si256();
    __m256i count2 = _mm256_setzero_si256();
    __m256i count3 = _mm256_setzero_si256();
    __m256i count4 = _mm256_setzero_si256();
    for (uint32_t i = 0; i < n; i++) {
      __m256i v1 = loadLow16(p[i] & 0xFFFF);
      __m256i v2 = loadLow16((p[i] >> 16) & 0xFFFF);
      __m256i v3 = loadLow16((p[i] >> 32) & 0xFFFF);
      __m256i v4 = loadLow16((p[i] >> 48) & 0xFFFF);
      v1 = _mm256_and_si256(v1, mask);
      count1 = _mm256_adds_epi16 (count1, vals);
      v2 = _mm256_and_si256(v2, mask);
      count2 = _mm256_adds_epi16 (count2, vals);
      v3 = _mm256_and_si256(v3, mask);
      count3 = _mm256_adds_epi16 (count3, vals);
      v4 = _mm256_and_si256(v4, mask);
      count4 = _mm256_adds_epi16 (count4, vals);
    }
}
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Dov
  • 8,000
  • 8
  • 46
  • 75
  • 1
    _mm256_set1_epi16 – jbapple Jun 19 '21 at 14:13
  • 1
    `vpbroadcastw` works with a memory or xmm source. https://www.felixcloutier.com/x86/vpbroadcastb:vpbroadcastw:vpbroadcastd:vpbroadcastq – Peter Cordes Jun 19 '21 at 14:33
  • 1
    Also, see https://github.com/mklarqvist/positional-popcount for efficient SIMD positional-popcount, especially over arrays with multiple vectors of data, for chunk sizes including 16-bit. It's possible to do somewhat better than this by maintaining higher information density (not masking down to only 0 / 1 at the bottom of a 16-bit element), meaning you get more work out of each element. – Peter Cordes Jun 19 '21 at 14:34
  • 1
    Also, you'd need a variable-count shift before the `adds` for this to really work; for example the `1<<15` mask (which you typoed as `1<<16` which overflows a uint16_t), will produce a `0` or `1<<15`. Adding two of those will signed-saturate to `0x8000` (-INT16_MIN), i.e. the masked original value, so you can't usefully add at all for that top element, without a shift! – Peter Cordes Jun 19 '21 at 14:42
  • Possible duplicates for the position popcount part: [AVX2 column population count algorithm over each bit-column separately](https://stackoverflow.com/q/58486138) / [Count each bit-position separately over many 64-bit bitmasks, with AVX but not AVX2](https://stackoverflow.com/q/55081525) . But I haven't looked to see if using techniques that work for 64 separate counts are appropriate / optimal for 16 separate counts- certainly you could just add 4 counts down to 1 to get 16-bit positional from 64-bit. But probably best to look at that github link. – Peter Cordes Jun 19 '21 at 14:49

1 Answers1

3

For your overall positional-popcount problem, see https://github.com/mklarqvist/positional-popcount for heavily optimized implementations, which are also correct unlike this, which you obviously haven't had time to debug yet since you were missing a building block. Adding multiple x & (1<<15) results in an int16_t element is going to saturate right away, so you'd need something, perhaps a variable-count shift or a compare like x & mask == mask. Or probably better a total redesign: Related SO Q&As:


The title question: broadcast a uint16_t

The instruction is vpbroadcastw. It works with a memory or xmm source. On Intel CPUs, it decodes to a load and a shuffle (port 5) uop, unlike 32, 64, or 128-bit broadcasts which are handled purely in the load port.

The intrinsics for it are:

  • __m256i _mm256_set1_epi16( int16_t ) - if you only have a scalar.
  • __m256i _mm256_broadcastw_epi16 (__m128i a) - to broadcast the bottom element of a vector.

To avoid violating the strict-aliasing rule in C, you're correct that accessing uint64_t p[] elements and masking them is a safe approach, while point a uint16_t * at it wouldn't be. (If you deref it normally; but unfortunately there's no load intrinsic that hides the deref inside an aliasing-safe intrinsic, so you'd have to memcpy into a uint16_t tmp var or something...)

Modern GCC is smart enough to compile __m256i v4 = _mm256_set1_epi16((p[i] >> 48) & 0xFFFF); into vpbroadcastw ymm0, WORD PTR [rdi+6+rdx*8], not doing anything stupid like an actual 64-bit scalar shift and then vmovd + xmm-source broadcast. (even with only -Og https://godbolt.org/z/W6o5hKTbz)

But that's when only using one of the counts, with the others optimized away. (I just used a volatile __m256i sink to assign things to as a way to stop the optimizer removing the loop entirely.)

https://godbolt.org/z/fzs9PEbMq shows with heavier optimization, using count2 and count4 gets GCC to do a scalar load of the uint64_t and break it up with two separate scalar shifts, before vmovd xmm0, edx / ... / vmovd xmm0, eax. So that's quite bad. :/

     // compiles to a vpbroadcastw load with an offset
     // but violates strict aliasing
      __m256i v2 = _mm256_set1_epi16( *(1 + (uint16_t*)&p[i])  );

To make that safe, you could use memcpy into a temporary, or GNU C __attribute__((may_alias)). (The same attribute is used in the definition of __m256i itself).

typedef uint16_t aliasing_u16 __attribute__((aligned(1), may_alias));

      __m256i v1 = _mm256_set1_epi16(*(0 + (aliasing_u16*)&p[i]));
      __m256i v2 = _mm256_set1_epi16(*(1 + (aliasing_u16*)&p[i]));
      __m256i v3 = _mm256_set1_epi16(*(2 + (aliasing_u16*)&p[i]));
      __m256i v4 = _mm256_set1_epi16(*(3 + (aliasing_u16*)&p[i]));

Compiles with 4x vpbroadcastw loads (https://godbolt.org/z/6v9esqK9P). (Instructions using those loads elided)

        vpbroadcastw    ymm1, WORD PTR [rdi]
...
        add     rdi, 8
        vpbroadcastw    ymm1, WORD PTR [rdi-6]
...
        vpbroadcastw    ymm1, WORD PTR [rdi-4]
...
        vpbroadcastw    ymm1, WORD PTR [rdi-2]
...

This is probably better to avoid bottlenecks on port 5 on Intel CPUs. Both vmovd xmm, eax and vpbroadcastw ymm,xmm are 1 uop that can only run on port 5 on Skylake-family CPUs. (https://agner.org/optimize/ https://uops.info/).

vpbroadcastw with a memory source still needs a shuffle uop (p5), but getting the data from elsewhere into the SIMD domain uses a load port instead of another port 5 uop. And it can micro-fuse the load into a single front-end uop.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847