If array cannot be divided by 8 (for integer), what is the best way to write cycle for it? Possible way I figured out so far is to divide it into 2 separate cycles: 1 main cycle for almost all elements; and 1 tail cycle with maskload/maskstore for remaining 1-7 elements. But it's not looking like the best way.
for (auto i = 0; i < vec.size() - 8; i += 8) {
__m256i va = _mm256_loadu_si256((__m256i*) & vec[i]);
//do some work
_mm256_storeu_si256((__m256i*) & vec[i], va);
}
for (auto i = vec.size() - vec.size() % 8; i < vec.size(); i += 8) {
auto tmp = (vec.size() % 8) + 1;
char chArr[8] = {};
for (auto j = 0; j < 8; ++j) {
chArr[j] -= --tmp;
}
__m256i mask = _mm256_setr_epi32(chArr[0],
chArr[1], chArr[2], chArr[3], chArr[4], chArr[5], chArr[6], chArr[7]);
__m256i va = _mm256_maskload_epi32(&vec[i], mask);
//do some work
_mm256_maskstore_epi32(&vec[i], mask, va);
}
Could it be made looking better without hitting the performance? Just removing second for-loop for a single load doesn’t help much because it’s only 1 line saved out of dozen.
If I put maskload/maskstore in the main cycle it will slower down it significantly. There is also no maskloadu/maskstoreu, so I can't use this for unaligned array.