How to most efficiently store a part of m128i/m256i, while ignoring some number of elements from the beginning/end

Question

My processor is Intel 9700K.

I have either __m128i or __m256i containing char, short or int. I need to write a store function that ignores a given number of elements from the beginning, from the end or both from the beginning and the end.

For ints and above I use _mm_maskstore_epi32 and though I would love to improve on it's performance, it's not too bad.

However for smaller types I originally went with _mm_maskmoveu_si128 and it is extremely slow - replacing it for short with the first code I tried: using _mm_maskstore_epi32 + storing 1 short in scalar with a brunch, resulted in a 10 times performance improvement.

So, my question is:

I doubt that I'm the first person who needs this - is there maybe a good known way to do it?
_mm_maskstore_epi32 accepts int*. Is it a requirement that this int* is aligned to 4 bytes? Maybe it's a requirement, that it's aligned to 16 bytes (32 for 256 bit register)? The internet is not terribly clear on that.

I mostly care about 256 bit registers over 128 bit ones.

UPD: I'm only using the masks on the boundaries of my array. The thing is - this was completely dominating my performance even on 1kb arrays (walking through 1kb of data and computing the values was less important then how I handle stores on the sides). I tried an even simpler alternative - just calling memcpy for not ignored elements - and it's faster then my clever mask_store hacks (probably because I don't need to prepare a mask for mask_store). I probably need something like a specialised memcpy for less then 32 bytes of data.

Can you overwrite the memory with it's preexisting values (i.e., load -> blend -> store)? Do you know at compile-time how many elements you need to store? And do you care about throughput, latency, ...? — chtz, Jun 04 '20 at 00:57
There isn't good hardware support for masking narrow elements until AVX512BW (Skylake Xeon), with native masking for every instruction including `vmovdqu8`. Until then, you could maybe check the mask for having pairs of `short` elements the same so `epi32` will work, otherwise I think you have to loop over the vector and do narrow scalar stores. Or what chtz said: vector blend with the old contents of memory. That's probably going to be better than checking something about the mask bits. — Peter Cordes, Jun 04 '20 at 01:00
@PeterCordes, @chtz - Yeah, I'm only using masks on the boundaries of my array already. I tried basic `memcpy` - it's better then my clever solution. The thing is that even on a 1k of data I'm completly dominated by mask stores on the side. I tried basic memcpy, it performs better than my clever hacks. There are probably better hacks though. — Denis Yaroshevskiy, Jun 04 '20 at 01:15
If you have to load or store any bytes in a 32-byte region inside one cache line, it's most efficient to just load or store them all with a SIMD vector store. The hardware is that wide (in Haswell and later); doing a masked store means preserving some of what was there before. If you don't need to do that, then don't! — Peter Cordes, Jun 04 '20 at 01:18
@PeterCordes I do. I'm writing generic algorithm that needs to work on arbitrary array. I don't know what's to the left or to the right of my array. — Denis Yaroshevskiy, Jun 04 '20 at 01:21
Oh, so you're wanting this for the end of a small array copy, small enough you want to avoid the overhead of a call to `memcpy`? Not for masking arbitrary elements in the middle? Usually the best strategy is to do a vector load that ends at the end of the source array, and store it into the corresponding spot in the destination. It's fine that it might overlap the last full vector store; the store buffer / L1d cache can absorb that no problem. CPUs with AVX also have efficient unaligned loads/stores. — Peter Cordes, Jun 04 '20 at 01:25
Related: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933). If your copies are actually 1kiB, seriously just call `memcpy`, at least if you're on a system like GNU/Linux where memcpy uses AVX on systems that support it. glibc `memcpy` is very well optimized for large copies, including handling the start/end of the copy. And yes, `_mm_maskmoveu_si128` has an NT hint (evicts from cache) so you definitely don't want it. — Peter Cordes, Jun 04 '20 at 05:22
@PeterCordes not memcpy, I'm writing an `inclusive_scan` and I need to handle all possible array sizes, specifically it's possible to get an array with size less then my vector size. I can blend with a previous array that I stored, but this means more special cases and more code bloat. — Denis Yaroshevskiy, Jun 04 '20 at 08:41
`memcpy` already has that bloat to make small copies fast, as well as large copies, if branch prediction predicts correctly. I'm still not clear whether your real problem can simply call `memcpy`, or if you need to avoid it for some correctness reason. — Peter Cordes, Jun 04 '20 at 08:44
@PeterCordes - memcpy for char/short is the best solution I have so far. It's slower then `maskstore` for ints and that is still slower than I'd like it to be. I think I can do better. — Denis Yaroshevskiy, Jun 04 '20 at 08:50
@PeterCordes - do you know if `_mm_maskstore_epi32` requires a 4 byte alignment? — Denis Yaroshevskiy, Jun 05 '20 at 08:25
@DenisYaroshevskiy: It doesn't require alignment. SIMD instructions either require full alignment or none, not to an element size. The "exceptions" section on https://www.felixcloutier.com/x86/vmaskmov doesn't mention any alignment-related exceptions. It mentions something about behaviour with the AC flag set, but you can assume that's not the case. Otherwise plain scalar misaligned accesses would fault, so AC-enabled is unusable for normal compiler-generated code. — Peter Cordes, Jun 05 '20 at 08:38
@PeterCordes - unfortunately, didn't help. I posted all of the numbers below if you are interested. — Denis Yaroshevskiy, Jun 06 '20 at 17:19
@PeterCordes measured different approaches here: https://stackoverflow.com/a/62492369/5021064 if you are interested — Denis Yaroshevskiy, Jun 20 '20 at 23:45

Denis Yaroshevskiy · Answer 1 · 2020-06-14T00:22:52.927

Unfortunately, I didn't quite get as fast as I wanted to be - so I will leave the question open in case someone knows a better answer.

Where did the problem originate.

I was looking into how to implement inclusive scan in-place on top of AVX2 SIMD extensions. My solution is entirely based on: @Zboson answer.

  [a      b           c               d        ]
+ [0      a           b               c        ]
= [a   (a + b)     (b + c)         (c + d)     ]
+ [0      0           a            (a + b)     ]
= [a   (a + b)   (a + b + c)   (a + b + c + d) ]

Every one range algorithm that I implemented before worked well with the following iteration pattern (sudo code):

auto aligned_f = previous_aligned_address(f);
auto aligned_l = previous_aligned_address(l);
ignore_first_n ignore_first{f - aligned_f};

if (aligned_f != aligned_l) {
   step(aligned_f, ignore_first);  // Do a simd step, ignoring everything 
                                   // between aligned_f and f.
   aligned_f += register_width;
   ignore_first = ignore_first_n{0};

   // Big unrolled loop.
   main_loop(aligned_f, aligned_l);

   if (aligned_f == aligned_l) return;
}

ignore_last_n ignore_last {aligned_l + register_width - l};
ignore_first_last ignore = combine(ignore_first, ignore_last);

// Do a simd step, ignoring everything between aligned_l and l.
// + handle the case when register is bigger than the array size.
step(aligned_l, ignore);

(If you do not know why it's OK to do this - see).

As both @PeterCordes and @PaulR mentioned, if you change the iteration pattern - mixin some of the other values and do a plain unaligned store and this is probably what I'll have to do. Then you can do at most one true masked store - only when register does not fit completely.

However, that is more assembly generated and I was not sure if I implemented store(address, register, ignore) in the most efficient way possible - hence was my question.

UPDATE: did try this, even without mixing anything in, you can just first load 2 overlapping registers and then store them back. Made things slightly worse. This does not seem to be a good idea, at least for inclusive scan.

Measurements

The fast enough I defined as "beat the scalar version on 40 bytes of data" - 40 chars, 20 shorts and 10 integers. You might notice that 40 bytes > then the register size - so I would have to add an even smaller measurement for a more complicated iteration pattern.

I show the measurements for 2 cases <256, 1> - use 256 bit regestisters, no unrolling, <256, 2> - unroll the main loop twice.

NOTE: In benchmarks I account for possible code alignment issues by aligning benchmarking code in 64 different ways and picking minimum value.

_mm_maskmoveu_si128

Originally I went with _mm256_maskstore for sizeof(T) >= 4 and 2 _mm_maskmoveu_si128 for the rest.

This, as you can see - performed extremely poor - for char we loose to the scalar code about 10 times, about 20 times for short and 2 times for int.

Use memcpy for char and short

I tried a few different things: use _mm256_maskstore for short, memcpy for int, write my own inline memcpy for my this case. The best i got was: memcpy for char and short and maskstore for int.

It's a win for char, couple of nanoseconds difference between using no unrolling and unrolling twice, about a 30% loss for short and a 50% loss for int.

So, at the very least with my implementation of store(ptr, reg, ignore) I need to do a different iteration pattern if I don't want to peel loops.

Listing for store(addr, reg, ignore)

NOTE: I removed wrappers and adapters, might have added a few bugs.

// Only showing one ignore_broadcast, they are very similar and
// are actually generated with templates
template <register_256 Register, std::same<int> T>
inline __m256i ignore_broadcast(ignore_first_n ignore) {
     __m256i idxs = _mm256_set_epi32(7, 6, 5, 4, 3, 2, 1, 0);
     __m256i n_broadcasted = _mm256_set1_epi32(ignore.n - 1);
     return _mm256_cmpgt_epi32(idxs, n_broadcasted);
}

template <template Register, typename T, typename Ignore>
void store(Register reg, T* ptr, Ignore ignore) {
    if constexpr (sizeof(T) >= 4) {
        const auto mask = ignore_broadcast<Register, T>(ignore);
        _store::maskstore(ptr, mask, reg);
        return;
    }

    std::size_t start = 0, n = sizeof(reg) / sizeof(T);
    if constexpr (std::is_same_v<Ignore, ignore_first_n>) {
        start += ignore.n;
        n -= ignore.n;
    } else if constexpr (std::is_same_v<Ignore, ignore_last_n>) {
        n -= ignore.n;
    } else {
        static_assert(std::is_same_v<Ignore, ignore_first_last>);
        start += ignore.first_n;
        n -= ignore.first_n + ignore.last_n;
    }

    // This requires to store the register on the stack.
    std::memcpy(raw_ptr + start, reinterpret_cast<T*>(&reg) + start, n * sizeof(T));
}

What does memcpy do

This is the memcpy that gets called.

It implements copy for under 32 bytes in the following way:

    #if VEC_SIZE > 16
        /* From 16 to 31.  No branch when size == 16.  */
    L(between_16_31):
        vmovdqu        (%rsi), %xmm0
        vmovdqu        -16(%rsi,%rdx), %xmm1
        vmovdqu        %xmm0, (%rdi)
        vmovdqu        %xmm1, -16(%rdi,%rdx)
        ret
    #endif
    L(between_8_15):
        /* From 8 to 15.  No branch when size == 8.  */
        movq        -8(%rsi,%rdx), %rcx
        movq        (%rsi), %rsi
        movq        %rcx, -8(%rdi,%rdx)
        movq        %rsi, (%rdi)
        ret
    L(between_4_7):
        /* From 4 to 7.  No branch when size == 4.  */
        movl        -4(%rsi,%rdx), %ecx
        movl        (%rsi), %esi
        movl        %ecx, -4(%rdi,%rdx)
        movl        %esi, (%rdi)
        ret
    L(between_2_3):
        /* From 2 to 3.  No branch when size == 2.  */
        movzwl        -2(%rsi,%rdx), %ecx
        movzwl        (%rsi), %esi
        movw        %cx, -2(%rdi,%rdx)
        movw        %si, (%rdi)
        ret

So basically - take the biggest register that fits and do two overlapping stores. I tried to do that inline - calling memcpy was faster - maybe I didn't do right though.

Assembly and code

Reading my code might be a bit tricky, especially because I'm relying on eve library that is not yet open-source.

So I compiled and published couple of assembly listings:

Complete assembly for int, no unrolling Complete assembly for short, no unrolling

My code can be found here

PS: Measuring big size

If you are interested, on a big enough array doing this type of vectorisation is a good win. On 10'000 bytes for example.

About 5 times for chars, 3 times for shorts and 2 times for ints.

PS: On unrolling

I didn't come up with some clever unrolling. The very basic unrolling twice gives about 10% win for 10000 bytes of short. Unrolling more didn't help. The reason why the win is this small, I suspect, is because the algorithm is quite complicated.

Denis Yaroshevskiy · Answer 2 · 2020-06-21T16:36:03.733

Didn't have a place to add this but it's related.

This question expanded for me into a more general question:
"How to modify array in-place if its size does not divide by the size of SIMD register".

Similar to what @PaulR said, I looked at a few approaches:

scalar clean-up.
use store(ignore) (somehow mask before the first byte and after the last byte)
if the size of array allows for it, overlap the first/last stores with the adjacent ones.
use unaligned loads/stores all the way and do a masked store as the last step.

NOTE: please take the results with a grain of salt, benchmarking is tricky and I might be wrong.

Code alignment

Short version: where your code is placed in the binary majorly affects performance.
Longer version: easy perf blog, llvm conference talk

Benchmarks

I take an array of a given size in bytes, and apply the algorithm to it.
I test all of the code alignments from 0 to 64 by including a no-op slide of that size before my benchmark.
(no-op slide is not executed in measurement).

benchmarking code

Environment

processor: intel 9700K
compiler: clang-11, built from trunk
os: fresh ubuntu

store(ignore_first/ignore_last) implementations

Details in: previous answer. I use maskstore for int and memcpy for char and short.

Algorithms/Code

I mostly focus here on doubling every element (x = x + x).
I refer to this algorithm as transform.

NOTE: my code is probably tricky to read, so I provide assembly for everything. Here it is if you want it. Relies on not yet open-source library eve.

I have 4 versions:

auto-vectorised std::transform - it relies on loop peeling for boundaries and uses unaligned loads/stores. disassemble for ints godbolt std::transform
transform<256, 4> - version with aligned reads/writes first and last stores have to deal with being partially out of bounds by using store(ignore). I unroll 4 times, compiler unrolls more on top. 256 - 256 bit registers. disassemble for ints
transform_overlap_stores<256, 4> - if it has more then 1 register of the array - loads two overlapping registers, transforms both and then stores them, to deal with the boundaries. This way there is no need to reload and blend. disassemle for ints
transform_unaligned<256, 4> - use unaligned loads stores. The last store with ignore. disassemble for ints

For baseline I also use:

reduce<256, 4> - add up all numbers. Again, I only unroll 4 times but compiler unrolls more. disassemble for ints
inclusive_scan_inplace<256, 1>, inclusive_scan_inplace<256, 2> - implementation of inclusive scan - see previous answer again. Unroll twice is better for shorts, no unrolling is better for chars and ints. Uses store(ignore) for first and last registers and aligned reads. disassemble for ints.

Given sufficient amount of data

As one might expect, given some noticeable amount of data and if your code is correctly aligned, the strategy you choose for sides is not important. The biggest size I measure is 10'000 bytes and all of transform algorithms finish in about 65ns.

The bit that I find interesting is that in a good scenario I don't see any penalty whats so ever for using unaligned loads/stores (which is what both std::transform and my transform_unaligned use).

It's also valuable to look here at code alignment impact

I usually suspect branches in such code alignment swings, but transform_unaligned is not more branchy than transform. So maybe unaligned reads are sensitive?

Conclusion: assuming that you can control alignment of your code, strategy on how to handle boundaries matters only on small array size.

Stores are what's expensive

Let's compare 3 algorithms on 40 worth of shorts: reduce, transform, inclusive_scan. reduce does much more additions and also a bunch of swaps, compared to transform getting semi-close to inclusive_scan.

We can see though that computation for reduce is much less important then stores for transform. We can also say that a lot of shifts and computations for inclusive_scan account for slightly more than 20% of its time (transform does all of the same things except for a much simpler computation).

I tried to profile to get more information but I'm not good enough at that.

Comparing different strategies for 40 bytes of data

What I would like is to beat loop peeling (there are non-performance reasons why it's annoying). Obviously, if I go small enough (like to 1 or 2 elements), that's not going to work. I arbitrary decided that if I beat loop peeling on 40 bytes it's a success.

Two ignore vs peeling

Default approach of doing to do store(ignore) beats loop peeling for chars and shorts, but looses about 25% for ints.

Two ignore vs Unaligned and one ignore

Using unaligned loads/stores stores to get one ignore does not seem to be beneficial - the difference is within 0.2 nanoseconds, which I believe to be noise.

Overlapping vs Two ignore

Overlapping stores is a win for chars and shorts, since that uses memcpy for store(ignore). However, it does not solve my problem for int.

UPD: I previously had here comparison for inclusive scan two ignore vs overlap stores but I found a mistake in that.

Given the increased complexity, I don't think I'll use this.

Two ignore vs peeling, inclusive scan

For completeness, reposting updated results for inclusive_scan - loop peeling does look very attractive. Sort of makes sense, since there is very little computational gain on 40 bytes. (40 bytes means two registers, so 64 bytes, but 24 of those is wasted).

Conclusion: if you care about small sizes, loop peeling is valuable when modifying an array in place. Trying to overlap a store does not seem to be an interesting optimisation.

P.S. Loop peeling when just reading data.

std::reduce will be auto-vectorized, and it will peel the loop. My reduce won't, it will replace with zeroes elements loaded outside of the array. That's a good strategy for 40 bytes of data.

I have also seen similar results for find. Sure, 40 bytes is an arbitrary "small size" and if you go smaller you can probably get where it's beneficial but this is the boundary I cut at.

Does current clang work around the uop-cache performance problem [introduced by Intel's microcode update to fix the JCC erratum](https://stackoverflow.com/questions/61016077/32-byte-aligned-routine-does-not-fit-the-uops-cache/61016915#61016915)? If not, that could explain a lot of the effect of code alignment or unrolling differences, if we're talking about alignment relative to a 32-byte boundary. — Peter Cordes, Jun 20 '20 at 23:55
@PeterCordes - very under-qualified to reply. I know 2 things: a) I believe that LSB is disabled (you showed me that at some point) b) Perf goes from min to max every other no-op (0 - bad, 1 - good, 2 - bad, 3 - good ... to 64) https://pasteboard.co/Je2F2RE.png — Denis Yaroshevskiy, Jun 21 '20 at 01:32

score 1 · Answer 3 · edited Jun 05 '20 at 07:35

1

There are several different ways of handling data sizes that are not a multiple of whole SIMD vectors. Here are three possibilities:

Scalar clean-up
- process whole vectors using SIMD
- process partial vector at end using scalar code
- pro: simple to implement
- con: inefficient unless no of SIMD iterations >> no of scalar iterations
Masked final SIMD iteration
- process whole vectors using SIMD
- process partial vector using SIMD and a mask to merge (blend) new output values with original output values which are out of bounds
- pro: more efficient than scalar clean-up
- con: more complex, some code duplication
- con with load/blend/store: non-atomic read-modify-write of data outside the array isn't thread safe, if other threads might be touching it. If your vectors are unaligned then touching an unmapped page would also be possible. Proper masked stores with fault suppression like AVX512 or _mm_maskstore_epi32 avoid both these problems.
Overlap final vector
- process whole vectors using SIMD
- for final SIMD vector use overlap such that vector starts at n - vector_size (i.e. there will be an overlap of the last two vectors)
- pro: simple to implement, never accesses elements outside bounds
- con: only works for n >= vector_size

Choice of method will depend on a a number of factors, but mainly the typical size and range of n.

edited Jun 05 '20 at 07:35

Peter Cordes

328,167
45
605
847

answered Jun 04 '20 at 07:28

Paul R

208,748
37
389
560

1 seems to work semi OK. I'm not processing everything using scalar code, only mask_store and it's the best one I have so far. My questions is essentially - how to do it better then just memcpy. 2 can't really do 2 - I don't know what's outside of my array. Might be an unallocated page, might be some atomics involved, who knows. 3 Really don't want to do that - since I still need to do 1 as well in case when n < vector_size. – Denis Yaroshevskiy Jun 04 '20 at 09:19
1

Hmm, if you’re using 1, and the main loop is 256 bit SIMD, then you can do an optional single 128 bit SIMD iteration after the main SIMD loop to reduce the no of scalar iterations when you have more than half a vector left. That reduces the average no of scalar iterations significantly. Still not optimal though if n is small. – Paul R Jun 04 '20 at 11:27
1

turns out, this is actually what memcpy does - I posted assembly in my extremely long answer, if you are interested. – Denis Yaroshevskiy Jun 06 '20 at 17:18
1

did measurements for all approaches, see https://stackoverflow.com/a/62492369/5021064 if you are interested. – Denis Yaroshevskiy Jun 20 '20 at 23:44

How to most efficiently store a part of __m128i/__m256i, while ignoring some number of elements from the beginning/end

3 Answers3