What is the best way to loop AVX for un-even non-aligned array?

Question

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for remaining 1-7 elements. But it's not looking like the best way.

for (auto i = 0; i < vec.size() - 8; i += 8) {
    __m256i va = _mm256_loadu_si256((__m256i*) & vec[i]);
    //do some work
    _mm256_storeu_si256((__m256i*) & vec[i], va);
}
for (auto i = vec.size() - vec.size() % 8; i < vec.size(); i += 8) {
    auto tmp = (vec.size() % 8) + 1;
    char chArr[8] = {};
    for (auto j = 0; j < 8; ++j) {
        chArr[j] -= --tmp;
    }
    __m256i mask = _mm256_setr_epi32(chArr[0],
        chArr[1], chArr[2], chArr[3], chArr[4], chArr[5], chArr[6], chArr[7]);
    __m256i va = _mm256_maskload_epi32(&vec[i], mask);
    //do some work
    _mm256_maskstore_epi32(&vec[i], mask, va);
}

Could it be made looking better without hitting the performance? Just removing second for-loop for a single load doesn’t help much because it’s only 1 line saved out of dozen.

If I put maskload/maskstore in the main cycle it will slower down it significantly. There is also no maskloadu/maskstoreu, so I can't use this for unaligned array.

Yes, it can be removed, but there is still a dozen of additional lines (a few times more than in main loop) remaining. — Vladislav Kogan, Nov 21 '22 at 08:46
```vec.size() - 8``` is dangerous. If this is an ```std::vector```, size is unsigned. So if size is 7 or lower, you get wrap-around. Prefer ```i + 8 <= vec.size()```. Also ```auto i = 0``` will be int. So your loop counter may be too small — Homer512, Nov 21 '22 at 10:17
Related: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) . (@YvesDaoust: you need each mask separately, you just need a sliding window into an array of -1 / 0, which won't be a cache-line split if you align that array.) — Peter Cordes, Nov 21 '22 at 12:41
Another option if your update is idempotent is to do a final vector that ends at the end of the array. Avoiding a store-forwarding stall on that final iteration may take some doing, like starting the loop with a store, ending with a load for next iteration. — Peter Cordes, Nov 21 '22 at 12:41
@PeterCordes: right, but this is acceptable if the mask is not reloaded on every iteration. — , Nov 21 '22 at 12:47
@YvesDaoust: There's no reason to be loading a mask like that inside a loop anyway, especially since maskstore is quite slow on AMD CPUs, and not free on Intel. Even if you were, it doesn't cost any extra; at least on an Intel CPUs (and recent AMD) a 32-byte vector load has zero extra cost as long as it comes entirely from within one 64-byte cache line. Like I said you should do. 64 bytes of constant data is cheaper than 8x 32 = 256 bytes. You'd only want a LUT if you needed arbitrary masks, like 4-bit masks for `vpmaskmovq` based on some compare result that might not be all-1 then all-0. — Peter Cordes, Nov 21 '22 at 12:57
@VladislavKogan: You have it backwards: `_mm256_maskstore_epi32` is inherently unaligned. There is no alignment-required version of the asm instruction. https://www.felixcloutier.com/x86/vpmaskmov — Peter Cordes, Nov 21 '22 at 12:59

score 2 · Accepted Answer · answered Nov 21 '22 at 10:41

2

To expand on Yves' idea of prebuilding masks, here is one way to structure it:


#include <vector>
#include <immintrin.h>

void foo(std::vector<int>& vec)
{
    std::size_t size = vec.size();
    int* data = vec.data();
    std::size_t i;
    for(i = 0; i + 8 <= size; i += 8) {
        __m256i va = _mm256_loadu_si256((__m256i*) (data + i));
        asm volatile ("" : : : "memory"); // more work here
        _mm256_storeu_si256((__m256i*) (data + i), va);
    }
    static const int maskarr[] = {
        -1, -1, -1, -1, -1, -1, -1, -1,
         0,  0,  0,  0,  0,  0,  0,  0
    };
    if(i < size) {
        __m256i mask = _mm256_loadu_si256((const __m256i*)(
                maskarr + (i + 8 - size)));
        __m256i va = _mm256_maskload_epi32(data + i, mask);
        asm volatile ("" : : : "memory"); // more work here
        _mm256_maskstore_epi32(data + i, mask, va);
    }
}

A few notes:

As mentioned in my comment, i + 8 <= vec.size() is safer as it avoids a possible wrap-around if vec.size() is 7 or lower
Use size_t or ptrdiff_t instead of int for such loop counters
The if to skip over the last part is important. Masked memory operations with an all-zero mask are very slow
The static mask array can be slimmed by two elements since we know we never access an all-filled or all-zero mask array

answered Nov 21 '22 at 10:41

Homer512

9,144
2
8
25

2

[Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) shows an asm version of this idea. And some commentary about other things you can do, like a final unaligned vector if it's ok to process some elements more than once with a maybe-overlapping final vector. – Peter Cordes Nov 21 '22 at 12:42
1

@PeterCordes I never really understood that idea about doing an overlapped load. What do you do if the array is less than a full hardware vector start to end? Do you have a separate code branch just for that? – Homer512 Nov 21 '22 at 13:46
Yes, you'd need a separate case for that. In some programs, that code path will never execute because arrays are never tiny; in other cases (if a compiler auto-vectorized a loop that's actually always short), it will be the normal path. If the loop body isn't too big, this can work ok. I was playing around with how this would look in asm the other day, e.g .https://godbolt.org/z/4qbj5dhG9 is NASM syntax for adding two arrays into a separate destination, non-overlapping. Might be useful to teach GCC and/or clang to vectorize that way, maybe after some performance experiments. – Peter Cordes Nov 21 '22 at 13:52
Why does asm volatile ("" : : : "memory"); added here? Why does it necessary/preffered? – Vladislav Kogan Nov 23 '22 at 15:49
1

@VladislavKogan I only included that so that the compiler does not remove the memory loads and stores. Replace it with your actual data processing – Homer512 Nov 23 '22 at 16:02

What is the best way to loop AVX for un-even non-aligned array?

1 Answers1

Linked