3

I have an input vector of 16384 signed four bit integers. They are packed into 8192 Bytes. I need to interleave the values and unpack into signed 8 bit integers in two separate arrays.

a,b,c,d are 4 bit values.
A,B,C,D are 8 bit values.

Input = [ab,cd,...]
Out_1 = [A,C, ...]
Out_2 = [B,D, ...]

I can do this quite easily in C++.

constexpr size_t size = 32768;
int8_t input[size]; // raw packed 4bit integers
int8_t out_1[size];
int8_t out_2[size];

for (int i = 0; i < size; i++) {
    out_1[i] = input[i] << 4;
    out_1[i] = out_1[i] >> 4;
    out_2[i] = input[i] >> 4;
}

I would like to implement this to operate as fast as possible on general purpose processors. Good SIMD implementations of 8 bit deinterleaving to 16 bit integers exist such as in VOLK but I cannot find even basic bytewise SIMD shift operators.

https://github.com/gnuradio/volk/blob/master/kernels/volk/volk_8ic_deinterleave_16i_x2.h#L63

Thanks!

Derek
  • 1,572
  • 1
  • 14
  • 25
  • This may be interesting: https://stackoverflow.com/questions/44011366/avx-4-bit-integers – Vlad Feinstein Jul 31 '20 at 23:14
  • x86 SIMD doesn't have byte shifts, you have to emulate them via 16-bit or wider shifts and mask away bits that came into the top of each byte ([SSE/SIMD shift with one-byte element size / granularity?](https://stackoverflow.com/q/35002937)). This kind of sucks when you want arithmetic shifts to sign-extend; perhaps a different trick to get the high bits set could work for a fixed shift count. Like maybe `xor` with `0xf8` to set the high bits and flip the 4th bit, then `paddb` with `0x08` will correct bit 4 and either carry-out and clear the high bits, or leave them set. – Peter Cordes Aug 01 '20 at 00:02
  • 3
    Wait, your question says you need *signed*, but your C++ uses `uint8_t` for everything, not `int8_t`. Unsigned is much easier, just shift and mask. (Shifting twice for the low half is inefficient even if you have byte shifts; AND with `_mm_set1_epi8(0x0f)`) – Peter Cordes Aug 01 '20 at 00:05
  • Updated [SSE/SIMD shift with one-byte element size / granularity?](https://stackoverflow.com/q/35002937) with that idea: 4 uops (for Intel) is better than previous emulations of the non-existent `psrab` (`_mm_srai_epi8`). – Peter Cordes Aug 01 '20 at 00:42
  • With `uint8_t input`, your `out_2` results are still broken. (zero-extended not sign-extended.) You could make it an `int8_t*`, or cast it like `((int8_t)input[i]) >> 4`. That does actually auto-vectorize, fairly well with clang, fairly poorly with GCC: https://godbolt.org/z/zYhff7 – Peter Cordes Aug 01 '20 at 18:05

1 Answers1

1

Here is an example. Your question contained code that used unsigned operations, but the question asked about signed, so I was not sure what you wanted. If it is unsigned what you want, just remove the bits that implement sign extension.

const __m128i mm_mask = _mm_set1_epi32(0x0F0F0F0F);
const __m128i mm_signed_max = _mm_set1_epi32(0x07070707);

for (size_t i = 0u, n = size / 16u; i < n; ++i)
{
    // Load and deinterleave input half-bytes
    __m128i mm_input_even = _mm_loadu_si128(reinterpret_cast< const __m128i* >(input) + i);
    __m128i mm_input_odd = _mm_srli_epi32(mm_input_even, 4);

    mm_input_even = _mm_and_si128(mm_input_even, mm_mask);
    mm_input_odd = _mm_and_si128(mm_input_odd, mm_mask);

    // If you need sign extension, you need the following
    // Get the sign bits
    __m128i mm_sign_even = _mm_cmpgt_epi8(mm_input_even, mm_signed_max);
    __m128i mm_sign_odd = _mm_cmpgt_epi8(mm_input_odd, mm_signed_max);

    // Combine sign bits with deinterleaved input
    mm_input_even = _mm_or_si128(mm_input_even, _mm_andnot_si128(mm_mask, mm_sign_even));
    mm_input_odd = _mm_or_si128(mm_input_odd, _mm_andnot_si128(mm_mask, mm_sign_odd));

    // Store the results
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_1) + i, mm_input_even);
    _mm_storeu_si128(reinterpret_cast< __m128i* >(out_2) + i, mm_input_odd);
}

If your size is not a multiple of 16 then you need to also add handling of the tail bytes. You could use your non-vectorized code for that.

Note that in the code above you don't need byte-granular shifts as you have to apply the mask anyway. So any more coarse-grained shifts would do here.

Andrey Semashev
  • 10,046
  • 1
  • 17
  • 27
  • style note: I find `mm_` in your var names makes it harder to read, too many `mm`s everywhere. When I want to distinguish between a scalar and vector with similar names, I often put a `v` as the first letter of a var name, like `vsign`. But here I'd just use `__m128i input_even = ...` – Peter Cordes Aug 01 '20 at 16:43
  • 1
    See [SSE/SIMD shift with one-byte element size / granularity?](https://stackoverflow.com/q/35002937) for a more efficient way to sign-extend the low 4 bits into a byte in 2 instructions, using `pxor` / `paddb` with `set1(0x08)`. (With the upper 4 already zeroed). Should be more efficient than `pcmpgtb` / `pandn` / `por`. Hmm, especially if we do the `pxor` *before* separating into high/low halves! We can `pxor` with `set1_epi8(0x88)` before shifting. – Peter Cordes Aug 01 '20 at 16:45
  • 1
    Some versions using that trick, or vpshufb, including one which auto-vectorizes nicely that way with clang: https://godbolt.org/z/sGafhr. Unfortunate GCC missed optimizations make the pshufb version less attractive (doing an extra load of the same data) if you care about GCC. Will post as an answer at some point. – Peter Cordes Aug 02 '20 at 00:38
  • @PeterCordes The `pxor`/`paddb` trick is nice. I don't think moving `pxor` before deinterleaving will be faster because it will just move the same latency in the dependency chain. The two `pxor`s at a later point would be executed in parallel, so they are effectively equivalent to the one `pxor` before deinterleave. But the one `pxor` will probably consume slightly less power and definitely less memory. – Andrey Semashev Aug 02 '20 at 12:54
  • Re code on godbolt, note that you have loop unrolling enabled for clang but not gcc. Also, if you find inefficiencies with gcc, please report them upstream (https://gcc.gnu.org/bugzilla/). – Andrey Semashev Aug 02 '20 at 12:56
  • This bottlenecks on front-end throughput, not latency! Saving front-end uops is how you optimize this! Each 32-byte chunk of input is independent so the only loop-carried dependency chain is the loop counter. Out-of-order exec can overlap across iterations. – Peter Cordes Aug 02 '20 at 12:58
  • re: loop unrolling: no unrolling is the default for gcc (without PGO), unrolling is the default for clang. I had the option there because I had been looking at the clang `-fno-unroll-loops` output to simplify it for easy reading. – Peter Cordes Aug 02 '20 at 13:02