How to copy X bytes or bits from an __m128i into standard memory

Question

I have a loop that's adding int16s from two arrays together via _mm_add_epi16(). There's a small array and a large array, the results get written back to the large array. The intrinsic may get less than 8x int16s (128 bits) from the small array if it's reached its end - how do I store the results of _mm_add_epi16() back into standard memory int16_t* when I don't want all of its 128 bits? Padding the array to power-of-two is not an option. Example:

int16_t* smallArray;
int16_t* largeArray;
__m128i inSmallArray = _mm_load_si128((__m128i*)smallArray);
__m128i* pInLargeArray = (__m128i*)largeArray;
__m128i inLargeArray = _mm_load_si128(pInLargeArray);
inLargeArray = _mm_add_epi16(inLargeArray, inSmallArray);
_mm_store_si128(pInLargeArray, inLargeArray);

My guess is that I need to substitute _mm_store_si128() with a "masked" store somehow.

You can address its elements directly. That's cause `__m128i` is a union. — ALX23z, Aug 30 '20 at 15:23
Is the width a compile-time constant on any code paths, or easy to branch on? There are instructions like `movq` and `movd` that can store 8 or 4 bytes. Or if the length of smallArray is at least 16, you can do an unaligned final vector if you arrange your loop to leave the result in a variable to be stored next iteration (or when leaving the loop, after loading data for potentially-overlapping unaligned final vector). — Peter Cordes, Aug 30 '20 at 15:52
@PeterCordes the width is determined at runtime and I would need 16-bit granularity. I might be able to guarantee that smallArray be at least 16 bytes - can you elaborate on what unaligned final vector means? thank you! — GlassBeaver, Aug 30 '20 at 17:02
I mean like [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/a/34322447) — Peter Cordes, Aug 30 '20 at 17:12

Andrey Semashev · Accepted Answer · 2020-08-30T17:09:37.967

There is a _mm_maskmoveu_si128 intrinsic, which translates to maskmovdqu (in SSE) or vmaskmovdqu (in AVX).

// Store masks. The highest bit in each byte indicates the byte to store.
alignas(16) const unsigned char masks[16][16] =
{
    { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00 }
};

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);
    _mm_maskmoveu_si128(mm, reinterpret_cast< const __m128i& >(masks[n]), static_cast< char* >(storage));
}

The problem with this code is that maskmovdqu (and, presumably, vmaskmovdqu) instructions have an associated hint for non-temporal access to the target memory, which makes the instruction expensive and also requires a fence afterwards.

AVX adds new instructions vmaskmovps/vmaskmovpd (and AVX2 also adds vpmaskmovd/vpmaskmovq), which work similarly to vmaskmovdqu but do not have the non-temporal hint and only operate on 32 and 64-bit granularity.

// Store masks. The highest bit in each 32-bit element indicates the element to store.
alignas(16) const unsigned char masks[4][16] =
{
    { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 },
    { 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0xFF, 0x00, 0x00, 0x00, 0x00 }
};

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 4u);
    _mm_maskstore_epi32(static_cast< int* >(storage), reinterpret_cast< const __m128i& >(masks[n]), mm);
}

AVX-512 adds masked stores, and you could use vmovdqu8/vmovdqu16 with an appropriate mask to store 8 or 16-bit elements.

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);
    _mm_mask_storeu_epi8(storage, static_cast< __mmask16 >((1u << n) - 1u), mm);
}

Note that the above requires AVX-512BW and VL extensions.

If you require 8 or 16-bit granularity and don't have AVX-512 then you're better off with a function that manually stores the vector register piece by piece.

void store_n(__m128i mm, unsigned int n, void* storage)
{
    assert(n < 16u);

    unsigned char* p = static_cast< unsigned char* >(storage);
    if (n >= 8u)
    {
        _mm_storel_epi64(reinterpret_cast< __m128i* >(p), mm);
        mm = _mm_unpackhi_epi64(mm, mm); // move high 8 bytes to the low 8 bytes
        n -= 8u;
        p += 8;
    }

    if (n >= 4u)
    {
        std::uint32_t data = _mm_cvtsi128_si32(mm);
        std::memcpy(p, &data, sizeof(data)); // typically generates movd
        mm = _mm_srli_si128(mm, 4);
        n -= 4u;
        p += 4;
    }

    if (n >= 2u)
    {
        std::uint16_t data = _mm_extract_epi16(mm, 0); // or _mm_cvtsi128_si32
        std::memcpy(p, &data, sizeof(data));
        mm = _mm_srli_si128(mm, 2);
        n -= 2u;
        p += 2;
    }

    if (n > 0u)
    {
        std::uint32_t data = _mm_cvtsi128_si32(mm);
        *p = static_cast< std::uint8_t >(data);
    }
}

I think `vmaskmovdqu` has the same NT semantics; I think only AVX `vmaskmovps` and related instructions with 4 or 8 byte granularity do plain masked stores (until AVX512BW `vmaskmovdqu8`) — Peter Cordes, Aug 30 '20 at 15:55
@PeterCordes Intel SDM only documents a non-temporal hint for `maskmovdqu`, not for `vmaskmovdqu`. Do you have a reference where it also applies to `vmaskmovdqu`? — Andrey Semashev, Aug 30 '20 at 16:02
@PeterCordes I found the description of `maskmovdqu`/`vmaskmovdqu` in AMD APM, and the description there does suggest that both instructions have a non-temporal hint. I've updated my answer. Thanks. PS: I could not find the description of `vmaskmovdqu8`, so I cannot say anything about it. — Andrey Semashev, Aug 30 '20 at 16:30
That was a brain fart, AVX512 has masking for normal stores, which is why all vector `mov` instructions have an element-size as part of the mnemonic. I meant `vmovdqu8 [rdi]{k1}, xmm0` https://www.felixcloutier.com/x86/movdqu:vmovdqu8:vmovdqu16:vmovdqu32:vmovdqu64 — Peter Cordes, Aug 30 '20 at 16:32
Re: Intel's docs: It's ambiguous whether the VEX encoding has an NT hint or not in the [Intel docs](https://www.felixcloutier.com/x86/maskmovdqu), but I think their phrasing is compatible with reality if we take "MASKMOVDQU" to mean either encoding of the same instruction. I think it's normal for Intel docs to just talk about the base name of the instruction when both VEX and legacy-SSE versions work the same. It would be nice if the doc was more explicit, though. — Peter Cordes, Aug 30 '20 at 16:36
@AndreySemashev thank you! I should have mentioned it but I only have access to up to AVX1 and need 16 bit granularity - I guess I'll have to use the long version of store_n() with all those ifs — GlassBeaver, Aug 30 '20 at 16:58

How to copy X bytes or bits from an __m128i into standard memory

1 Answers1