How do I vectorize data_i16[0 to 15]?

Question

I'm on the Intel Intrinsic site and I can't figure out what combination of instructions I want. What I'd like to do is

result = high_table[i8>>4] & low_table[i8&15]

Where both table are 16bits (or more). shuffle seems like what I want (_mm_shuffle_epi8) however getting a 8bit value doesn't work for me. There doesn't seem to be a 16bit version and the non byte version seems to need the second param as an immediate value.

How am I suppose to implement this? Do I call _mm_shuffle_epi8 twice for each table, cast it to 16bits and shift the value by 8? If so which cast and shift instruction do I want to look at?

Unless your question is truly language-agnostic, please include a tag for the language. — Some programmer dude, Apr 26 '20 at 05:47
AVX2 has `vpermd` : select dwords from a vector of 8x 32-bit elements. But other than that, yeah `pshufb` (`_mm_shuffle_epi8`) is the way to go *if* your lookup table is separable. (i.e. the lookup based on the upper nibble doesn't depend on the lower nibble). Then sure, do that twice and AND them together. — Peter Cordes, Apr 26 '20 at 05:48
You're talking about concatenating the bits, but `&` in C is bitwise AND. I think you actually mean `(high_table[u8>>4] << 8) | low_table[u8 & 0xf]`. — Peter Cordes, Apr 26 '20 at 05:50
@PeterCordes I'm not good at reading intels psuedo code but vpermd could be a solution. No I mean if I were to use 2 8bit looks up for high table I'd do `(highA<<8 | highB) & (lowA<<8 | lowB)` — Eric Stotch, Apr 26 '20 at 05:53

Peter Cordes · Accepted Answer · 2020-04-27T01:21:10.050

To split your incoming indices into two vectors of nibbles, you want the usual bit-shift and AND. SSE doesn't have 8-bit shifts, so you have to emulate with a wider shift and an AND to mask away bits that shifted into the top of your bytes. (Because unfortunately for this use-case _mm_shuffle_epi8 does not ignore the high bits. If the top selector bit is set it zeros that output element.)

You definitely do not want to widen your incoming i8 vector to 16-bit elements; that would not be usable with _mm_shuffle_epi8.

AVX2 has vpermd : select dwords from a vector of 8x 32-bit elements. (only 3-bit indices so it's not good for your use-case unless your nibbles are only 0..7). AVX512BW has wider shuffles, including vpermi2w to index into a table of the concatenation of two vectors, or just vpermw to index words.

But for 128-bit vectors with just SSSE3, yeah pshufb (_mm_shuffle_epi8) is the way to go. You'll need two separate vectors for high_table, one for the upper byte and one for the lower byte of each word entry. And another two vectors for the halves of low_table.

Use _mm_unpacklo_epi8 and _mm_unpackhi_epi8 to interleave the low 8 bytes of two vectors, or the high 8 bytes of two vectors. That will give you the 16-bit LUT results you want, with the upper half in each word coming from the high-half vector.

i.e. you're building a 16-bit LUT out of two 8-bit LUTs with this interleave. And you're repeating the process twice for two different LUTs.

The code would look something like

// UNTESTED, haven't tried even compiling this.

// produces 2 output vectors, you might want to just put this in a loop instead of making a helper function for 1 vector.
// so I'll omit actually returning them.
void foo(__m128i indices)
{
   // these optimize away, only used at compile time for the vector initializers
   static const uint16_t high_table[16] = {...},
   static const uint16_t low_table[16] =  {...};

   // each LUT needs a separate vector of high-byte and low-byte parts
   // don't use SIMD intrinsics to load from the uint16_t tables and deinterleave at runtime, just get the same 16x 2 x 2 bytes of data into vector constants at compile time.
   __m128i high_LUT_lobyte = _mm_setr_epi8(high_table[0]&0xff, high_table[1]&0xff, high_table[2]&0xff, ... );
   __m128i high_LUT_hibyte = _mm_setr_epi8(high_table[0]>>8, high_table[1]>>8, high_table[2]>>8, ... );

   __m128i low_LUT_lobyte = _mm_setr_epi8(low_table[0]&0xff, low_table[1]&0xff, low_table[2]&0xff, ... );
   __m128i low_LUT_hibyte = _mm_setr_epi8(low_table[0]>>8, low_table[1]>>8, low_table[2]>>8, ... );


// split the input indexes: emulate byte shift with wider shift + AND
    __m128i lo_idx = _mm_and_si128(indices, _mm_set1_epi8(0x0f));
    __m128i hi_idx = _mm_and_si128(_mm_srli_epi32(indices, 4), _mm_set1_epi8(0x0f));

    __m128i lolo = _mm_shuffle_epi8(low_LUT_lobyte, lo_idx);
    __m128i lohi = _mm_shuffle_epi8(low_LUT_hibyte, lo_idx);

    __m128i hilo = _mm_shuffle_epi8(high_LUT_lobyte, hi_idx);
    __m128i hihi = _mm_shuffle_epi8(high_LUT_hibyte, hi_idx);

   // interleave results of LUT lookups into vectors 16-bit elements
    __m128i low_result_first  = _mm_unpacklo_epi8(lolo, lohi);
    __m128i low_result_second = _mm_unpackhi_epi8(lolo, lohi);
    __m128i high_result_first  = _mm_unpacklo_epi8(hilo, hihi);
    __m128i high_result_second = _mm_unpackhi_epi8(hilo, hihi);

    // first 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
    __m128i and_first = _mm_and_si128(low_result_first, high_result_first);
    // second 8x 16-bit high_table[i8>>4] & low_table[i8&15] results
    __m128i and_second = _mm_and_si128(low_result_second, high_result_second);

    // TOOD: do something with the results.
}

You could AND before interleaving, high halves against high halves and low against low. That might be somewhat better for instruction-level parallelism, letting execution of the ANDs overlap with the shuffles. (Intel Haswell through Skylake has only 1/clock throughput for shuffles.)

Choosing variable names is a struggle with stuff like this. Some people just give up and use non-meaningful names for some intermediate steps.

Sounds good. Just to confirm I understand. I need to split my 16bit high_table into highA highB both being 8 bits. I call _mm_shuffle_epi8 for both of them. Then I call _mm_unpacklo_epi8(A, B) which is essentially `(B<<8)|A`. But that only gets me 8 16bit. Calling _mm_unpackhi_epi8 gets me the other `8x16`. Then I repeat this for my second table and and them like the original code does. I think I understood correctly. Sounds like a good plan (and writing this out helps me think) — Eric Stotch, Apr 26 '20 at 06:05
@EricStotch: after seeing your comment under the question I revised my answer based on what you're actually doing: two separate 16-entry LUTs of 16-bit words separately indexed by input nibbles that just happen to be packed together. Not trying to emulate a 256-entry LUT of 16-bit words with an 8-bit index, which is what I was answering before (works if they're separable, like in some Galois Field and other use cases.) — Peter Cordes, Apr 26 '20 at 06:12
Anyway yes, your overall thing will handle 16 bytes of input nibble-pairs by splitting them up with right-shift / 2xAND, then 4x pshufb on 4 different vectors of LUT data. Then 2x punpcklbw and 2x punpckhbw, and 2x PAND to actually do the final AND, giving you 2 vectors (32 bytes) of 16x 16-bit results. — Peter Cordes, Apr 26 '20 at 06:15
I'm awful at SIMD and need to find a book in the morning. It took me longer than it should to figure out the AND is _mm_and_si128 and it seems like I need to use 16bits for right shift (_mm_sra_epi16). I guess I'll need to use _mm_unpacklo_epi8 with 0's. At least I'm starting to get the hang of this — Eric Stotch, Apr 26 '20 at 07:41
@EricStotch: x86 SIMD is notorious non-orthogonal with different element widths for the same operation added piecemeal in different SSE versions. And e.g. shifts only available in some operand-sizes not including 8-bit. But yeah, bitwise stuff doesn't have an element size so it's just `si128`. — Peter Cordes, Apr 26 '20 at 07:52
But no you don't need to unpacklo with zeros; the usual way is to emulate a byte shift by doing a wider shift and ANDing with `0xf`. You already need that vector constant for the low nibbles anyway. **See [Fast counting the number of set bits in \_\_m128i register](https://stackoverflow.com/q/17354971) for an example of unpacking bytes to low / high nibbles to feed a pshufb LUT** — Peter Cordes, Apr 26 '20 at 07:52
I couldn't figure out the widing shift so I wrote a question. Hopefully someone else can answer it before you see this :) https://stackoverflow.com/questions/61446596/how-do-i-efficiently-lookup-16bits-in-a-128bit-simd-vector — Eric Stotch, Apr 26 '20 at 19:13
@EricStotch: updated with an untested version. IDK if it even compiles, but it's the right idea at least. — Peter Cordes, Apr 27 '20 at 00:22
Thanks but I already wrote the code! It's the same except for variable names. — Eric Stotch, Apr 27 '20 at 02:13
@EricStotch: Hopefully it will help future readers, then. I thought there should be some actual code since I closed your other question as a duplicate of this, and you hadn't posted an answer here so I wasn't sure you weren't still stuck on some part of it. Or that you'd made some efficiency mistake like loading from the 16-bit LUT arrays at runtime and using `_mm_pack_epi16` or something to make the LUT vectors. It probably only took 10 minutes or so to type out the code I'd already been picturing. — Peter Cordes, Apr 27 '20 at 02:27

How do I vectorize data_i16[0 to 15]?

1 Answers1

Linked