1

I'm currently trying to implement an AVX2 version (Haswell CPU) of some existing scalar code of me. Which implements a step like this:

struct entry {
  uint32_t low, high;
};

// both filled with "random" data in previous loops
std::vector<entry> table;
std::vector<int>   queue;  // this is strictly increasing but
                           // without a constant delta

for (auto index : queue) {
  auto v = table[index];
  uint32_t rank = v.high + __builtin_popcount(_bzhi_u32(v.low, index % 32));
  use_rank(rank); // contains a lot of integer operations which nicely map to avx2
}

I've implemented this with 2 gather instructions that each load a int32 like this:

__m256iv_low  = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 0, index, 8);
__m256i v_high = _mm256_i32gather_epi32 (reinterpret_cast<int *>(table.data()) + 1, index, 8);

Is there a faster way two load those values ? I have thought about using 2 64 bit loads (which issue only half the amount of reads => less traffic for the execution ports) and then shuffle the resulting vectors to get v_low and v_high for example but sadly as far as I can tell most shuffle functions only allow to shuffle 128 bit separately.

Edit for Paul R: This code is part of a substring enumeration routine using the Burrows Wheeler Transform that I use in my compression algorithm. table contains rank data on a bit vector. The high part contains the number of ones in the previous entries and the lower part gets masked out and popcounted then added to get the final number of set bits in front of the given index. Afterwards a lot more computation happens that is luckily nicely parallelizable.

The deltas in the queue are very high in the beginning and the end (due to the nature of the algorithm). This caused a lot of cache misses and is the reason why I switched from SoA to AoS using shifts to reduce the pressure on the load ports in the scalar code.

Using SoA would also result in the same independent gather instructions but would double the amount of accessed cache lines.

Edit (partial answer): I tried using two _mm_i32gather_epi64 to half the number of memory accesses (and therefore the cycles, see here).

__m256i index; // contains the indices
__m128i low = _mm256_extractf128_si256(index, 0);
__m128i high = _mm256_extractf128_si256(index, 1);
__m256i v_part1 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), low , 8);
__m256i v_part2 = _mm256_i32gather_epi64(reinterpret_cast<long long int*>(table.data()), high, 8);

which loads my data into two ymm registers this format (no c++):

register v_part1:
[v[0].low][v[0].high][v[1].low][v[1].high][v[2].low][v[2].high][v[3].low][v[3].high]
register v_part2:
[v[4].low][v[4].high][v[5].low][v[5].high][v[6].low][v[6].high][v[7].low][v[7].high]

Is there an efficient way to interleave them in order to obtain the original format:

register v_low:
[v[0].low][v[1].low][v[2].low][v[3].low][v[4].low][v[5].low][v[6].low][v[7].low]
register v_high:
[v[0].high][v[1].high][v[2].high][v[3].high][v[4].high][v[5].high][v[6].high][v[7].high]
Community
  • 1
  • 1
Christoph Diegelmann
  • 2,004
  • 15
  • 26
  • 1
    That code is nonsensical and not valid C++. – John Zwinck Jan 27 '17 at 08:11
  • 2
    @JohnZwinck: It's AVX intrinsics. – MSalters Jan 27 '17 at 08:17
  • @MSalters: I was referring to the code like `uint64_t v; v.low`. – John Zwinck Jan 27 '17 at 08:18
  • @JohnZwinck woops, sorry. Orignal code loads 64 bits and does a shift (which was faster for me). Decided half way to make it a struct to make my intent more clear. I'll fix it. – Christoph Diegelmann Jan 27 '17 at 08:20
  • 1
    You might want to rethink your `table` data structure - make it SoA instead of AoS ? Of course this decision depends on what else you are doing with this data, which you haven't told us. – Paul R Jan 27 '17 at 08:49
  • 1
    @Christoph: thanks for the update - that's helpful. Note that gather instructions are very slow/inefficient - a solution using shuffle/permute/unpack/whatever, even if it takes several instructions, would be preferable, but your non-contiguous indices will most likely scupper this idea. – Paul R Jan 27 '17 at 09:21
  • Yeah that's why I tryed to turn those two `_mm256_i32gather_epi32` into two `_mm_i32gather_epi64` (which issues only have the number of reads and takes around half the cycles). But sadly I can't come up with the code to shuffle it back to the same vector format. – Christoph Diegelmann Jan 27 '17 at 09:29
  • @Christoph: I think you can use `_mm256_srli_epi64`/`_mm256_slli_epi64` to get the required elements to "line up", and then use `_mm256_blend_epi32` to merge them. Looks like four instructions to me? – Paul R Jan 27 '17 at 15:17
  • 1
    @Christoph: d'oh - I just realised this will interleave the elements in the wrong order - please ignore above suggestion - I will go back to sleep now... – Paul R Jan 27 '17 at 15:39

1 Answers1

2

I've found a way to reorder the values using 5 instructions myself:

// this results in [01][45][23][67] when gathering
index = _mm256_permute4x64_epi64(index, _MM_SHUFFLE(3,1,2,0));

// gather the values
__m256i v_part1 = _mm256_i32gather_epi64(i, _mm256_extractf128_si256(index, 0), 8);
__m256i v_part2 = _mm256_i32gather_epi64(i, _mm256_extractf128_si256(index, 1), 8);

// seperates low and high values
v_part1 = _mm256_shuffle_epi32(v_part1, _MM_SHUFFLE(3,1,2,0));
v_part2 = _mm256_shuffle_epi32(v_part2, _MM_SHUFFLE(3,1,2,0));

// unpack merges lows and highs: [01][23][45][56]
o1 = _mm256_unpackhi_epi64(v_part1, v_part2);
o2 = _mm256_unpacklo_epi64(v_part1, v_part2);
Christoph Diegelmann
  • 2,004
  • 15
  • 26