3

I'm trying to figure out how to use masked loads and stores for the last few elements to be processed. My use case involves converting a packed 10 bit data stream to 16 bit which means loading 5 bytes before storing 4 shorts. This results in different masks of different types.

The main loop itself is not a problem. But at the end I'm left with up to 19 bytes input / 15 shorts output which I thought I could process in up to two loop iterations using the 128 bit vectors. Here is the outline of the code.

#include <immintrin.h>

#include <stddef.h>
#include <stdint.h>

void convert(uint16_t* out, ptrdiff_t n, const uint8_t* in)
{
    uint16_t* const out_end = out + n;
    for(uint16_t* out32_end = out + (n & -32); out < out32_end; in += 40, out += 32) {
        /*
         * insert main loop here using ZMM vectors
         */
    }
    if(out_end - out >= 16) {
        /*
         * insert half-sized iteration here using YMM vectors
         */
        in += 20;
        out += 16;
    }
    // up to 19 byte input remaining, up to 15 shorts output
    const unsigned out_remain = out_end - out;
    const unsigned in_remain = (out_remain * 10 + 7) / 8;
    unsigned in_mask = (1 << in_remain) - 1;
    unsigned out_mask = (1 << out_remain) - 1;
    while(out_mask) {
        __mmask16 load_mask = _cvtu32_mask16(in_mask);
        __m128i packed = _mm_maskz_loadu_epi8(load_mask, in);
        /* insert computation here. No masks required */
        __mmask8 store_mask = _cvtu32_mask8(out_mask);
        _mm_mask_storeu_epi16(out, store_mask, packed);
        in += 10;
        out += 8;
        in_mask >>= 10;
        out_mask >>= 8;
    }
}

(Compile with -O3 -mavx2 -mavx512f -mavx512bw -mavx512vl -mavx512dq)

My idea was to create a bit mask from the number of remaining elements (since I know it fits comfortably in an integer / mask register), then shift values out of the mask as they are processed.

I have two issues with this approach:

  1. I'm re-setting the masks from GP registers each iteration instead of using the kshift family of instructions
  2. _cvtu32_mask8 (kmovb) is the only instruction in this code that requires AVX512DQ. Limiting the number of suitable hardware platforms just for that seems weird

What I'm wondering about:

Can I cast mmask32 to mmask16 and mmask8?

If I can, I could set it once from the GP register, then shift it in its own register. Like this:

    __mmask32 load_mask = _cvtu32_mask32(in_mask);
    __mmask32 store_mask = _cvtu32_mask32(out_mask);
    while(out < out_end) {
        __m128i packed = _mm_maskz_loadu_epi8((__mmask16) load_mask, in);
        /* insert computation here. No masks required */
        _mm_mask_storeu_epi16(out, (__mmask8) store_mask, packed);
        load_mask = _kshiftri_mask32(load_mask, 10);
        store_mask = _kshiftri_mask32(store_mask, 8);
        in += 10;
        out += 8;
    }

GCC seems to be fine with this pattern. But Clang and MSVC create worse code, moving the mask in and out of GP registers without any apparent reason.

Homer512
  • 9,144
  • 2
  • 8
  • 25
  • 1
    In practice on real implementations (at least GCC and clang), `__mmask32` is just a typedef for `uint32_t`, and so on. So you can freely convert integers to masks, and the compiler will have to use `kmov` instructions. (Sometimes when you wish the compiler would use a `kshift`, it uses kmov to/from a GP register, so it's nice that intrinsics exist for mask operations to hint the compiler.) I'm not sure if this implicit conversion is officially part of the intrinsics API, or if there's some cast wrapper one could be using. – Peter Cordes Jul 15 '22 at 18:45
  • I'd recommend compiling with `-march=skylake-avx512` instead of manually enabling some AVX-512 extensions. You generally want to enable an appropriate `-mtune` option as well (which -march implies but `-mavx512f` doesn't), and enable other useful stuff like popcnt and BMI2 that all CPUs with AVX-512 also have. – Peter Cordes Jul 15 '22 at 18:48
  • 1
    Other than Xeon Phi (where you'd very likely want to compile binaries specifically for it, because different tuning choices are appropriate like -mno-vzeroupper), all AVX512 CPUs have AVX512DQ, and it's not very plausible that some future CPU will come along with AVX512VL but not AVX512DQ; those are pretty simple early extensions. Unlike newer fancier things like VNNI or 4NNIW (KNM) and maybe VPOPCNT, or maybe AVX-512VBMI (byte shuffles which a CPU might want to omit). https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512 Basically, I don't expect future CPUs to be less than skylake-avx512 – Peter Cordes Jul 15 '22 at 18:52
  • @PeterCordes You're right, the cast is unnecessary. The intrinsics still seem to help GCC. It's the only compiler that will reuse the mask when compiled that way. Sadly, Clang and MSVC create horrible code with those. Therefore not using mask intrinsics seems to be the only decent cross-platform strategy – Homer512 Jul 15 '22 at 20:07
  • Re: the actual problem you're trying to solve: often you can do a final store that may partially overlap previous stores, if the destination doesn't overlap the source or it's idempotent or you loaded early. And if you're not summing or other reduction that would be thrown off by redoing a calculation for the same element twice. – Peter Cordes Jul 15 '22 at 20:10
  • 1
    Related: [Missing AVX-512 intrinsics for masks?](https://stackoverflow.com/q/45167997) - there didn't used to be intrinsics for all mask operations, but it's still just up to the compiler to decide whether to kmov to GPR and back or not. With `kadd` and so on having 4-cycle latency, it's tempting for compilers to avoid, perhaps? (Usually throughput is more important, though.) – Peter Cordes Jul 18 '22 at 20:51
  • @PeterCordes Oh, you're right. For some reason I thought the latency was 1 or 2. And I turned it into a loop-carried dep chain. Woops. – Homer512 Jul 18 '22 at 20:56
  • A very short chain is fine (3 shifts); that's what out-of-order exec is for. Making the branching fully predictable by not doing an early-out is worth considering, just let zero-masked stores happen. (Unless that tends to go into an unmapped or not-dirty page and make fault suppression slow. Or touch cache lines that didn't need it.) – Peter Cordes Jul 18 '22 at 20:57
  • @PeterCordes True, but it explains why compilers were so hesitant to use them. – Homer512 Jul 18 '22 at 21:12

0 Answers0