7

I'm trying to speed up an algorithm which performs a series of lookup tables. I'd like to use SSE2 or AVX2. I've tried using the _mm256_i32gather_epi32 command but it is 31% slower. Does anyone have any suggestions to any improvements or a different approach?

Timings: C code = 234 Gathers = 340

static const int32_t g_tables[2][64];  // values between 0 and 63

template <int8_t which, class T>
static void lookup_data(int16_t * dst, T * src)
{
    const int32_t * lut = g_tables[which];

    // Leave this code for Broadwell or Skylake since it's 31% slower than C code
    // (gather is 12 for Haswell, 7 for Broadwell and 5 for Skylake)

#if 0
    if (sizeof(T) == sizeof(int16_t)) {
        __m256i avx0, avx1, avx2, avx3, avx4, avx5, avx6, avx7;
        __m128i sse0, sse1, sse2, sse3, sse4, sse5, sse6, sse7;
        __m256i mask = _mm256_set1_epi32(0xffff);

        avx0 = _mm256_loadu_si256((__m256i *)(lut));
        avx1 = _mm256_loadu_si256((__m256i *)(lut + 8));
        avx2 = _mm256_loadu_si256((__m256i *)(lut + 16));
        avx3 = _mm256_loadu_si256((__m256i *)(lut + 24));
        avx4 = _mm256_loadu_si256((__m256i *)(lut + 32));
        avx5 = _mm256_loadu_si256((__m256i *)(lut + 40));
        avx6 = _mm256_loadu_si256((__m256i *)(lut + 48));
        avx7 = _mm256_loadu_si256((__m256i *)(lut + 56));
        avx0 = _mm256_i32gather_epi32((int32_t *)(src), avx0, 2);
        avx1 = _mm256_i32gather_epi32((int32_t *)(src), avx1, 2);
        avx2 = _mm256_i32gather_epi32((int32_t *)(src), avx2, 2);
        avx3 = _mm256_i32gather_epi32((int32_t *)(src), avx3, 2);
        avx4 = _mm256_i32gather_epi32((int32_t *)(src), avx4, 2);
        avx5 = _mm256_i32gather_epi32((int32_t *)(src), avx5, 2);
        avx6 = _mm256_i32gather_epi32((int32_t *)(src), avx6, 2);
        avx7 = _mm256_i32gather_epi32((int32_t *)(src), avx7, 2);
        avx0 = _mm256_and_si256(avx0, mask);
        avx1 = _mm256_and_si256(avx1, mask);
        avx2 = _mm256_and_si256(avx2, mask);
        avx3 = _mm256_and_si256(avx3, mask);
        avx4 = _mm256_and_si256(avx4, mask);
        avx5 = _mm256_and_si256(avx5, mask);
        avx6 = _mm256_and_si256(avx6, mask);
        avx7 = _mm256_and_si256(avx7, mask);
        sse0 = _mm_packus_epi32(_mm256_castsi256_si128(avx0), _mm256_extracti128_si256(avx0, 1));
        sse1 = _mm_packus_epi32(_mm256_castsi256_si128(avx1), _mm256_extracti128_si256(avx1, 1));
        sse2 = _mm_packus_epi32(_mm256_castsi256_si128(avx2), _mm256_extracti128_si256(avx2, 1));
        sse3 = _mm_packus_epi32(_mm256_castsi256_si128(avx3), _mm256_extracti128_si256(avx3, 1));
        sse4 = _mm_packus_epi32(_mm256_castsi256_si128(avx4), _mm256_extracti128_si256(avx4, 1));
        sse5 = _mm_packus_epi32(_mm256_castsi256_si128(avx5), _mm256_extracti128_si256(avx5, 1));
        sse6 = _mm_packus_epi32(_mm256_castsi256_si128(avx6), _mm256_extracti128_si256(avx6, 1));
        sse7 = _mm_packus_epi32(_mm256_castsi256_si128(avx7), _mm256_extracti128_si256(avx7, 1));
        _mm_storeu_si128((__m128i *)(dst),      sse0);
        _mm_storeu_si128((__m128i *)(dst + 8),  sse1);
        _mm_storeu_si128((__m128i *)(dst + 16), sse2);
        _mm_storeu_si128((__m128i *)(dst + 24), sse3);
        _mm_storeu_si128((__m128i *)(dst + 32), sse4);
        _mm_storeu_si128((__m128i *)(dst + 40), sse5);
        _mm_storeu_si128((__m128i *)(dst + 48), sse6);
        _mm_storeu_si128((__m128i *)(dst + 56), sse7);
    }
    else
#endif
    {
        for (int32_t i = 0; i < 64; i += 4)
        {
            *dst++ = src[*lut++];
            *dst++ = src[*lut++];
            *dst++ = src[*lut++];
            *dst++ = src[*lut++];
        }
    }
}
Paul R
  • 208,748
  • 37
  • 389
  • 560
ChipK
  • 401
  • 2
  • 9
  • 20
  • 2
    See also: [In what situation would the AVX2 gather instructions be faster than individually loading the data?](http://stackoverflow.com/questions/24756534/in-what-situation-would-the-avx2-gather-instructions-be-faster-than-individually). – Paul R Mar 04 '16 at 07:53
  • Updated my answer some more. I think your best hope is to specialize the code for a specific shuffle (contents of `g_tables`). With some `shufps` to move data between vectors and shuffle at the same time, and `pshufb`, you might be able to set up for some vector stores. – Peter Cordes Mar 04 '16 at 08:16

1 Answers1

12

You're right that gather is slower than a PINSRD loop on Haswell. It's probably nearly break-even on Broadwell. (See also the tag wiki for perf links, especially Agner Fog's insn tables, microarch pdf, and optimization guide)


If your indices are small, or you can slice them up, pshufb can be used as parallel LUT with 4bit indices. It gives you sixteen 8bit table entries, but you can use stuff like punpcklbw to combine two vectors of byte results into one vector of 16bit results. (Separate tables for high and low halves of the LUT entries, with the same 4bit indices).

This kind of technique gets used for Galois Field multiplies, when you want to multiply every element of a big buffer of GF16 values by the same value. (e.g. for Reed-Solomon error correction codes.) Like I said, taking advantage of this requires taking advantage of special properties of your use-case.


AVX2 can do two 128b pshufbs in parallel, in each lane of a 256b vector. There is nothing better until AVX512F: __m512i _mm512_permutex2var_epi32 (__m512i a, __m512i idx, __m512i b). There are byte (vpermi2b in AVX512VBMI), word (vpermi2w in AVX512BW), dword (this one, vpermi2d in AVX512F), and qword (vpermi2q in AVX512F) element size versions. This is a full cross-lane shuffle, indexing into two concatenated source registers. (Like AMD XOP's vpperm).

The two different instructions behind the one intrinsic (vpermt2d / vpermi2d) give you a choice of overwriting the table with the result, or overwriting the index vector. The compiler will pick based on which inputs are reused.


Your specific case:

*dst++ = src[*lut++];

The lookup-table is actually src, not the variable you've called lut. lut is actually walking through an array which is used as a shuffle-control mask for src.

You should make g_tables an array of uint8_t for best performance. The entries are only 0..63, so they fit. Zero-extending loads into full registers are as cheap as normal loads, so it just reduces the cache footprint. To use it with AVX2 gathers, use vpmovzxbd. The intrinsic is frustratingly difficult to use as a load, because there's no form that takes an int64_t *, only __m256i _mm256_cvtepu8_epi32 (__m128i a) which takes a __m128i. This is one of the major design flaws with intrinsics, IMO.

I don't have any great ideas for speeding up your loop. Scalar code is probably the way to go here. The SIMD code shuffles 64 int16_t values into a new destination, I guess. It took me a while to figure that out, because I didn't find the if (sizeof...) line right away, and there are no comments. :( It would be easier to read if you used sane variable names, not avx0... Using x86 gather instructions for elements smaller than 4B certainly requires annoying masking. However, instead of pack, you could use a shift and OR.

You could make an AVX512 version for sizeof(T) == sizeof(int8_t) or sizeof(T) == sizeof(int16_t), because all of src will fit into one or two zmm registers.


If g_tables was being used as a LUT, AVX512 could do it easily, with vpermi2b. You'd have a hard time with out AVX512, though, because a 64 byte table is too big for pshufb. Using four lanes (16B) of pshufb for each input lane could work: Mask off indices outside 0..15, then indices outside 16..31, etc, with pcmpgtb or something. Then you have to OR all four lanes together. So this sucks a lot.


possible speedups: design the shuffle by hand

If you're willing to design a shuffle by hand for a specific value of g_tables, there are potential speedups that way. Load a vector from src, shuffle it with a compile-time constant pshufb or pshufd, then store any contiguous blocks in one go. (Maybe with pextrd or pextrq, or even better movq from the bottom of the vector. Or even a full-vector movdqu).

Actually, loading multiple src vectors and shuffling between them is possible with shufps. It works fine on integer data, with no slowdowns except on Nehalem (and maybe also on Core2). punpcklwd / dq / qdq (and the corresponding punpckhwd etc) can interleave elements of vectors, and give different choices for data movement than shufps.

If it doesn't take too many instructions to construct a few full 16B vectors, you're in good shape.

If g_tables can take on too many possible values, it might be possible to JIT-compile a custom shuffle function. This is probably really hard to do well, though.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • I was hoping to avoid re-coding every time the tables changed. I had considered _mm256_shuffle_epi8 or some variation but I was worried in the end it would NOT save any time. I'm curious to see if the gather instruction actually saves time in either Broadwell or Skylake in the end. – ChipK Mar 04 '16 at 18:03
  • 1
    I coded up a solution that used SSE and a series of shuffles (and other operations) and unfortunately it was slower (time = 616) - it might not be optimal too. – ChipK Mar 05 '16 at 00:26
  • @ChipK: Unfortunately, until AVX512, or maybe Skylake gather, I don't think there's much hope other than a manually-coded shuffle. Did you do it with 128b vectors, or 256b? You probably need a lot less shuffling to make contiguous 128b vectors. And I forgot to mention that immediate blends are fast. `_mm_blend_epi16` uses the shuffle port (of which Haswell only has one), but AVX2 `_mm_blend_epi32` can run on all three vector execution ports in Haswell to Skylake. There's also `_mm_alignr_epi8` for combining data from two vectors. – Peter Cordes Mar 05 '16 at 01:19
  • @ChipK: Also, did you try ordering your instructions so the `pack` operations overlap with the gathers? Gather takes so many uops that out-of-order execution might not even be getting started on packing the results of the first couple gathers until most of the gathers are done. This kind of insn schedule is sometimes done for you by the compiler, and sometimes attempts at doing it in C are defeated by the compiler. It probably won't help, and you're probably just totally bottlenecked by the gathers no matter what. – Peter Cordes Mar 05 '16 at 01:21
  • Agner said when he tested Skylake that Intel improved gather a bit more. I wonder how much? I just read the summary of his test but probably he provides more details in tests or a manual. – Z boson Mar 05 '16 at 14:44
  • 1
    @Zboson: `VPGATHERDD ymm, ymm` uops / recip throughput from Agner's tables: Haswell: 34/12. BDW: 14/7. SKL: 4/5. So it looks like SKL improved gather throughput some, and also significantly improved how much it can overlap with other work. The 128b xmm version is 20/9, 10/6, 4/4. So perhaps even Broadwell ymm gather is worth using for this, even though you have to unpack and repack. – Peter Cordes Mar 05 '16 at 21:13
  • 2
    Unfortunately, [Intel has patented this whole techninque](https://www.google.com/patents/US20040054879) of using PSHUFB as a table lookup, including the "trick" of splitting it up into multiple shuffles if there are too many elements. How the patent office let this method that people have been using forever (no doubt way before Intel had any SIMD at all) is one thing, but _why_ Intel would want to patent anything that would greatly discourage anyone who knew about it from using key instructions in their instruction set for a common and powerful is beyond me. – BeeOnRope Jan 24 '18 at 07:31