8

In SSE3, the PALIGNR instruction performs the following:

PALIGNR concatenates the destination operand (the first operand) and the source operand (the second operand) into an intermediate composite, shifts the composite at byte granularity to the right by a constant immediate, and extracts the right-aligned result into the destination.

I'm currently in the midst of porting my SSE4 code to use AVX2 instructions and working on 256bit registers instead of 128bit. Naively, I believed that the intrinsics function _mm256_alignr_epi8 (VPALIGNR) performs the same operation as _mm_alignr_epi8 only on 256bit registers. Sadly however, that is not exactly the case. In fact, _mm256_alignr_epi8 treats the 256bit register as 2 128bit registers and performs 2 "align" operations on the two neighboring 128bit registers. Effectively performing the same operation as _mm_alignr_epi8 but on 2 registers at once. It's most clearly illustrated here: _mm256_alignr_epi8

Currently my solution is to keep using _mm_alignr_epi8 by splitting the ymm (256bit) registers into two xmm (128bit) registers (high and low), like so:

__m128i xmm_ymm1_hi = _mm256_extractf128_si256(ymm1, 0);
__m128i xmm_ymm1_lo = _mm256_extractf128_si256(ymm1, 1);
__m128i xmm_ymm2_hi = _mm256_extractf128_si256(ymm2, 0);
__m128i xmm_ymm_aligned_lo = _mm_alignr_epi8(xmm_ymm1_lo, xmm_ymm1_hi, 1);
__m128i xmm_ymm_aligned_hi = _mm_alignr_epi8(xmm_ymm2_hi, xmm_ymm1_lo, 1);
__m256i xmm_ymm_aligned = _mm256_set_m128i(xmm_ymm_aligned_lo, xmm_ymm_aligned_hi);

This works, but there has to be a better way, right? Is there a perhaps more "general" AVX2 instruction that should be using to get the same result?

eladidan
  • 2,634
  • 2
  • 26
  • 39

3 Answers3

5

What are you using palignr for? If it's only to handle data misalignment, simply use misaligned loads instead; they are generally "fast enough" on modern Intel µ-architectures (and will save you a lot of code size).

If you need palignr-like behavior for some other reason, you can simply take advantage of the unaligned load support to do it in a branch-free manner. Unless you're totally load-store bound, this is probably the preferred idiom.

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
    // Do whatever your compiler needs to make this buffer 64-byte aligned.
    // You want to avoid the possibility of a page-boundary crossing load.
    char buffer[64];

    // Two aligned stores to fill the buffer.
    _mm256_store_si256((__m256i *)&buffer[0], v0);
    _mm256_store_si256((__m256i *)&buffer[32], v1);

    // Misaligned load to get the data we want.
    return _mm256_loadu_si256((__m256i *)&buffer[n]);
}

If you can provide more information about how exactly you're using palignr, I can probably be more helpful.

Stephen Canon
  • 103,815
  • 19
  • 183
  • 269
  • The latency won't be very good, because the load will have an extra ~10 cycles of latency from a store-forwarding stall on Intel CPUs. IDK if store-forwarding stalls are a throughput problem, though. They may not be. – Peter Cordes Aug 07 '17 at 08:24
  • 1
    @PeterCordes: There's no throughput hazard, only latency. The approach sketched out here makes sense in situations where the store can be hoisted to hide the latency or the stored data can be re-used to extract a variety of different alignments. Of course, we have two-source shuffles in AVX-512, which are usually a better solution. – Stephen Canon Aug 25 '17 at 18:15
  • Oh good point, this is excellent for generating different windows onto the same two vectors. It's also good for a runtime-variable shift count. – Peter Cordes Aug 25 '17 at 18:19
  • I’ve dealt with this by just using palignr ymm,ymm,imm as is and deal with the crazed data scramble fallout, by just leaving it in the crazed order and either sort it out in the LSU with 16b memory accesses or with a ton of comments to document where everything ends up and patch it up at a convenient moment. Sometimes the crazy behavior in a pack instruction undoes the crazy behavior in an earlier instruction, and it all works out. Just not solving the problem in this way certainly convolutes your code, but vector lanes do not necessarily need to be contiguous data, especially for AVX2. – Ian Ollmann Aug 31 '23 at 07:20
  • In certain cases, the right answer is to just unroll your SSE loop x2 and do all your vector loads and stores high/low 16b at a time into ymm, and have the even loop iterations in the low half of the vector and the odd loop iterations in the high half. In my opinion, AVX2 is best thought of as a vector of vectors and not a bigger vector. – Ian Ollmann Aug 31 '23 at 07:20
4

We need 2 instructions: “vperm2i128” and “vpalignr” to extend “palignr” on 256 bits.

See: https://software.intel.com/en-us/blogs/2015/01/13/programming-using-avx2-permutations

user1649948
  • 651
  • 4
  • 12
3

The only solution I was able to come up with for this is:

static inline __m256i _mm256_alignr_epi8(const __m256i v0, const __m256i v1, const int n)
{
  if (n < 16)
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 0);
    __m128i v0l = _mm256_extractf128_si256(v0, 1);
    __m128i v1h = _mm256_extractf128_si256(v1, 0);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
  else
  {
    __m128i v0h = _mm256_extractf128_si256(v0, 1);
    __m128i v0l = _mm256_extractf128_si256(v1, 0);
    __m128i v1h = _mm256_extractf128_si256(v1, 1);
    __m128i vouth = _mm_alignr_epi8(v0l, v0h, n - 16);
    __m128i voutl = _mm_alignr_epi8(v1h, v0l, n - 16);
    __m256i vout = _mm256_set_m128i(voutl, vouth);
    return vout;
  }
}

which I think is pretty much identical to your solution except it also handles shifts of >= 16 bytes.

Paul R
  • 208,748
  • 37
  • 389
  • 560
  • yup, it's the same solution. but if this is the only way then it seems like a big oversight by the designers of AVX2 instructions – eladidan Dec 15 '11 at 10:07
  • I couldn't get this to compile... I get the compilation error: "catastrophic error: Intrinsic parameter must be an immediate value" on the following line: "__m128i vouth = _mm_alignr_epi8(v0l, v0h, n);". Supposidely, because n is not an immidiate. How were you able to bypass this? I'm using Intel C++ Compiler – eladidan Dec 15 '11 at 11:10
  • It works for me, so long as n is a compile-time constant - I'm using the Intel ICC compiler too, but compiling as C rather than C++ if that makes any difference, and it also works for me with gcc. – Paul R Dec 15 '11 at 14:51
  • I haven't looked closely at AVX2 yet - is there anything there which might help (albeit not until 2013) ? – Paul R Dec 15 '11 at 14:52
  • 1
    It's not really an oversight; it's just the way that AVX was implemented. Most instructions treat the 256-bit register like two independent 128-bit registers. I think it allowed migration and backward compatibility with SSE to be implemented more easily. – Jason R Dec 16 '11 at 03:47
  • 2
    Continuing what Jason wrote, `palignr` was a really half-baked approach to handling misaligned data (because the shift amount was an immediate, not supplied from a register). Intel seems to have realized that and simply made misaligned data access fast enough that it's (mostly) no longer an issue. – Stephen Canon Dec 26 '11 at 16:18
  • @Stephen: even with an immediate shift amount palignr is still very useful for neighbourhood operations such as filters - having to perform multiple misaligned loads just to get shifted versions of registers is not every efficient, even if misaligned loads have zero overhead. – Paul R Dec 27 '11 at 15:41
  • 2
    @PaulR: For that, you can still use `palignr` independently on the two halves of an AVX register, and use the same algorithm as you would with SSE -- just do two independent batches of work in the two halves of the register. I agree that it would be nice to have the full 32B shift, but it apparently is not justifiable in terms of area/power/complexity. It's entirely possible that Intel *could* have added the operation, but it would have been *less* efficient than the misaligned load solution. Given that the load bandwidth was doubled in Sandy Bridge, it's a very reasonable workaround. – Stephen Canon Dec 27 '11 at 15:48