1

If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for remaining 1-7 elements. But it's not looking like the best way.

for (auto i = 0; i < vec.size() - 8; i += 8) {
    __m256i va = _mm256_loadu_si256((__m256i*) & vec[i]);
    //do some work
    _mm256_storeu_si256((__m256i*) & vec[i], va);
}
for (auto i = vec.size() - vec.size() % 8; i < vec.size(); i += 8) {
    auto tmp = (vec.size() % 8) + 1;
    char chArr[8] = {};
    for (auto j = 0; j < 8; ++j) {
        chArr[j] -= --tmp;
    }
    __m256i mask = _mm256_setr_epi32(chArr[0],
        chArr[1], chArr[2], chArr[3], chArr[4], chArr[5], chArr[6], chArr[7]);
    __m256i va = _mm256_maskload_epi32(&vec[i], mask);
    //do some work
    _mm256_maskstore_epi32(&vec[i], mask, va);
}

Could it be made looking better without hitting the performance? Just removing second for-loop for a single load doesn’t help much because it’s only 1 line saved out of dozen.

If I put maskload/maskstore in the main cycle it will slower down it significantly. There is also no maskloadu/maskstoreu, so I can't use this for unaligned array.

Vladislav Kogan
  • 561
  • 6
  • 15
  • 1
    Why would you use a loop for a single load ??? –  Nov 21 '22 at 08:12
  • Yes, it can be removed, but there is still a dozen of additional lines (a few times more than in main loop) remaining. – Vladislav Kogan Nov 21 '22 at 08:46
  • 1
    Pre-build all possible masks. –  Nov 21 '22 at 08:49
  • 2
    ```vec.size() - 8``` is dangerous. If this is an ```std::vector```, size is unsigned. So if size is 7 or lower, you get wrap-around. Prefer ```i + 8 <= vec.size()```. Also ```auto i = 0``` will be int. So your loop counter may be too small – Homer512 Nov 21 '22 at 10:17
  • Related: [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) . (@YvesDaoust: you need each mask separately, you just need a sliding window into an array of -1 / 0, which won't be a cache-line split if you align that array.) – Peter Cordes Nov 21 '22 at 12:41
  • Another option if your update is idempotent is to do a final vector that ends at the end of the array. Avoiding a store-forwarding stall on that final iteration may take some doing, like starting the loop with a store, ending with a load for next iteration. – Peter Cordes Nov 21 '22 at 12:41
  • @PeterCordes: right, but this is acceptable if the mask is not reloaded on every iteration. –  Nov 21 '22 at 12:47
  • @YvesDaoust: There's no reason to be loading a mask like that inside a loop anyway, especially since maskstore is quite slow on AMD CPUs, and not free on Intel. Even if you were, it doesn't cost any extra; at least on an Intel CPUs (and recent AMD) a 32-byte vector load has zero extra cost as long as it comes entirely from within one 64-byte cache line. Like I said you should do. 64 bytes of constant data is cheaper than 8x 32 = 256 bytes. You'd only want a LUT if you needed arbitrary masks, like 4-bit masks for `vpmaskmovq` based on some compare result that might not be all-1 then all-0. – Peter Cordes Nov 21 '22 at 12:57
  • 1
    @VladislavKogan: You have it backwards: `_mm256_maskstore_epi32` is inherently unaligned. There is no alignment-required version of the asm instruction. https://www.felixcloutier.com/x86/vpmaskmov – Peter Cordes Nov 21 '22 at 12:59

1 Answers1

2

To expand on Yves' idea of prebuilding masks, here is one way to structure it:


#include <vector>
#include <immintrin.h>

void foo(std::vector<int>& vec)
{
    std::size_t size = vec.size();
    int* data = vec.data();
    std::size_t i;
    for(i = 0; i + 8 <= size; i += 8) {
        __m256i va = _mm256_loadu_si256((__m256i*) (data + i));
        asm volatile ("" : : : "memory"); // more work here
        _mm256_storeu_si256((__m256i*) (data + i), va);
    }
    static const int maskarr[] = {
        -1, -1, -1, -1, -1, -1, -1, -1,
         0,  0,  0,  0,  0,  0,  0,  0
    };
    if(i < size) {
        __m256i mask = _mm256_loadu_si256((const __m256i*)(
                maskarr + (i + 8 - size)));
        __m256i va = _mm256_maskload_epi32(data + i, mask);
        asm volatile ("" : : : "memory"); // more work here
        _mm256_maskstore_epi32(data + i, mask, va);
    }
}

A few notes:

  • As mentioned in my comment, i + 8 <= vec.size() is safer as it avoids a possible wrap-around if vec.size() is 7 or lower
  • Use size_t or ptrdiff_t instead of int for such loop counters
  • The if to skip over the last part is important. Masked memory operations with an all-zero mask are very slow
  • The static mask array can be slimmed by two elements since we know we never access an all-filled or all-zero mask array
Homer512
  • 9,144
  • 2
  • 8
  • 25
  • 2
    [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](https://stackoverflow.com/q/34306933) shows an asm version of this idea. And some commentary about other things you can do, like a final unaligned vector if it's ok to process some elements more than once with a maybe-overlapping final vector. – Peter Cordes Nov 21 '22 at 12:42
  • 1
    @PeterCordes I never really understood that idea about doing an overlapped load. What do you do if the array is less than a full hardware vector start to end? Do you have a separate code branch just for that? – Homer512 Nov 21 '22 at 13:46
  • Yes, you'd need a separate case for that. In some programs, that code path will never execute because arrays are never tiny; in other cases (if a compiler auto-vectorized a loop that's actually always short), it will be the normal path. If the loop body isn't too big, this can work ok. I was playing around with how this would look in asm the other day, e.g .https://godbolt.org/z/4qbj5dhG9 is NASM syntax for adding two arrays into a separate destination, non-overlapping. Might be useful to teach GCC and/or clang to vectorize that way, maybe after some performance experiments. – Peter Cordes Nov 21 '22 at 13:52
  • Why does asm volatile ("" : : : "memory"); added here? Why does it necessary/preffered? – Vladislav Kogan Nov 23 '22 at 15:49
  • 1
    @VladislavKogan I only included that so that the compiler does not remove the memory loads and stores. Replace it with your actual data processing – Homer512 Nov 23 '22 at 16:02