I'm currently trying to implement an AVX2 version (Haswell CPU) of some existing scalar code of me. Which implements a step like this:
struct entry {
uint32_t low, high;
};
// both filled with "random" data in previous loops
std::vector<entry> table;
std::vector<int> queue; // this is strictly increasing but
// without a constant delta
for (auto index : queue) {
auto v = table[index];
uint32_t rank = v.high + __builtin_popcount(_bzhi_u32(v.low, index % 32));
use_rank(rank); // contains a lot of integer operations which nicely map to avx2
}
I've implemented this with 2 gather instructions that each load a int32 like this:
__m256iv_low = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 0, index, 8);
__m256i v_high = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 1, index, 8);
Is there a faster way two load those values ? I have thought about using 2 64 bit loads (which issue only half the amount of reads => less traffic for the execution ports) and then shuffle the resulting vectors to get v_low and v_high for example but sadly as far as I can tell most shuffle functions only allow to shuffle 128 bit separately.
Edit for Paul R:
This code is part of a substring enumeration routine using the Burrows Wheeler Transform that I use in my compression algorithm. table
contains rank data on a bit vector. The high part contains the number of ones in the previous entries and the lower part gets masked out and popcounted then added to get the final number of set bits in front of the given index. Afterwards a lot more computation happens that is luckily nicely parallelizable.
The deltas in the queue are very high in the beginning and the end (due to the nature of the algorithm). This caused a lot of cache misses and is the reason why I switched from SoA to AoS using shifts to reduce the pressure on the load ports in the scalar code.
Using SoA would also result in the same independent gather instructions but would double the amount of accessed cache lines.
Edit (partial answer):
I tried using two _mm_i32gather_epi64
to half the number of memory accesses (and therefore the cycles, see here).
__m256i index; // contains the indices
__m128i low = _mm256_extractf128_si256(index, 0);
__m128i high = _mm256_extractf128_si256(index, 1);
__m256i v_part1 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), low , 8);
__m256i v_part2 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), high, 8);
which loads my data into two ymm registers this format (no c++):
register v_part1:
[v[0].low][v[0].high][v[1].low][v[1].high][v[2].low][v[2].high][v[3].low][v[3].high]
register v_part2:
[v[4].low][v[4].high][v[5].low][v[5].high][v[6].low][v[6].high][v[7].low][v[7].high]
Is there an efficient way to interleave them in order to obtain the original format:
register v_low:
[v[0].low][v[1].low][v[2].low][v[3].low][v[4].low][v[5].low][v[6].low][v[7].low]
register v_high:
[v[0].high][v[1].high][v[2].high][v[3].high][v[4].high][v[5].high][v[6].high][v[7].high]