How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

Question

I have implemented an inline function (_mm256_concat_epi16). It concatenates two AVX2 vector containing 16-bit values. It works fine for first 8 numbers. If I want to use it for the rest of the vector I should change the implementation. But It would be better to use a single inline function in my main program.

The question is : Is there any better solution than mine or any suggestion to make this inline function more general which works on 16 values instead of my solution that works on 8 values? My solution concatenate 2 vectors but only 8 states of 16 possible state is solved.

**EDIT*My current solution for this question is using unaligned load function which exactly can read from any part from memory. But, when data is ready in register it might be better to reuse it. However, it might cause bottlenecks on port 5 which issues shuffle, permute, etc. But throughput might be enough (haven't test yet).

#include <stdio.h>
#include <x86intrin.h>

inline _mm256_print_epi16(__m256i a, char* name){
    short temp[16], i;
    _mm256_storeu_si256((__m256i *) &temp[0], a);
    for(i=0; i<16; i++)
        printf("%s[%d]=%4d , ",name,i+1,temp[i]);
    printf("\n");
}

inline __m256i _mm256_concat_epi16(__m256i a, __m256i  b, const int indx){
    return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,indx*2);
}

int main()
{
    __m256i a = _mm256_setr_epi16(101,102,103,104,105,106,107,108,109,1010,1011,1012,1013,1014,1015,1016);_mm256_print_epi16(a, "a");
    __m256i b = _mm256_setr_epi16(201,202,203,204,205,206,207,208,209,2010,2011,2012,2013,2014,2015,2016);_mm256_print_epi16(b, "b");

    _mm256_print_epi16(_mm256_concat_epi16(a,b,8), "c");//numbers: 0-8
    return 0;
}

The out put is :

// icc  -march=native -O3 -D _GNU_SOURCE -o "concat" "concat.c"
[fedora@localhost concatination]$ "./concat"
a[1]= 101 , a[2]= 102 , a[3]= 103 , a[4]= 104 , a[5]= 105 , a[6]= 106 , a[7]= 107 , a[8]= 108 , a[9]= 109 , a[10]=1010 , a[11]=1011 , a[12]=1012 , a[13]=1013 , a[14]=1014 , a[15]=1015 , a[16]=1016 , 
b[1]= 201 , b[2]= 202 , b[3]= 203 , b[4]= 204 , b[5]= 205 , b[6]= 206 , b[7]= 207 , b[8]= 208 , b[9]= 209 , b[10]=2010 , b[11]=2011 , b[12]=2012 , b[13]=2013 , b[14]=2014 , b[15]=2015 , b[16]=2016 , 
c[1]= 109 , c[2]=1010 , c[3]=1011 , c[4]=1012 , c[5]=1013 , c[6]=1014 , c[7]=1015 , c[8]=1016 , c[9]= 201 , c[10]= 202 , c[11]= 203 , c[12]= 204 , c[13]= 205 , c[14]= 206 , c[15]= 207 , c[16]= 208 ,

I looked at your code and since you need you function to be efficient and you want to make it inline, I don't think you can do any better. I pretty sure you already know that by changing that `0x21` you could go upper than `8` but as you already said can't use the same code and it needs to be slightly modified. I think you should let preprocessor decide with one to choose and let compiler inline that one. — m0h4mm4d, Jul 26 '17 at 19:58
@MohammadArabzadeh, I'm sure there is a much better solution than mine but thanks. Unfortunately preprocessor can not help — Amiri, Jul 26 '17 at 21:13
@MohammadArabzadeh: I don't think you can easily use the preprocessor to select different code based on a macro parameter. You could do something that defined different versions of the macro for different counts, with the count being part of the macro name instead of a parameter. This would be clunky, but might be required for clang (see my answer). — Peter Cordes, Jul 27 '17 at 08:07

Peter Cordes · Accepted Answer · 2017-09-01T01:25:39.887

It's impossible to give a general answer to this question. It's such a short fragment that the best strategy depends on the surrounding code and what CPU you're running on.

Sometimes we can rule out things that have no advantages on any CPU and just consume more of the same resources, but that's not the case when considering a tradeoff between unaligned loads vs. shuffles.

In a loop over a possibly-misaligned input array, you're probably best off using unaligned loads. Especially your input array will be aligned at runtime most of the time. If not, and it's a problem, then if possible do an unaligned first vector and then aligned from the first alignment boundary. I.e. the usual tricks for a prologue that gets to an alignment boundary for the main loop. But with multiple pointers, it's usually best to align your store pointer, and do unaligned loads (according to Intel's optimization manual), if your pointers are misaligned relative to each other. (See Agner Fog's optimization guides and other links in the x86 tag wiki.)

On recent Intel CPUs, vector loads that cross a cache-line boundary still have pretty good throughput, but this is one reason why you might consider an ALU strategy, or a mix of shuffles and overlapping loads (in an unrolled loop you might alternate strategies so you don't bottleneck on either one).

As Stephen Canon points out in _mm_alignr_epi8 (PALIGNR) equivalent in AVX2 (a possible duplicate of this), if you need several different offset windows into the same concatenation of two vectors, then two stores + repeated unaligned loads is excellent. On Intel CPUs, you get 2-per-clock throughput for 256b unaligned loads as long as they don't cross a cache-line boundary (so alignas(64) your buffer).

Store/reload is not great for the single-use case, though, because of store-forwarding failure for a load that isn't fully contained within either store. It's still cheap for throughput, but expensive for latency. Another huge advantage is that it's efficient with a runtime-variable offset.

If latency is an issue, using ALU shuffles can be good (especially on Intel where lane-crossing shuffles aren't a lot more expensive than in-lane). Again, think about / measure what your loop bottlenecks on, or just try store/reload vs. ALU.

The shuffle strategy:

Your current function can only compile if indx is known at compile time (because palignr needs the byte-shift-count as an immediate).

As @Mohammad suggested, you could pick from different shuffles at compile time, depending on the indx value. He seemed to be suggesting a CPP macro, but that would be ugly.

Much easier to simply use if(indx>=16) or something like that, which will optimize away. (You could make indx a template parameter if a compiler refused to compile your code with an apparently "variable" shift count.) Agner Fog uses this in his Vector Class Library (license=GPL), for functions like template <uint32_t d> static inline Vec8ui divide_by_ui(Vec8ui const & x).

Related: Emulating shifts on 32 bytes with AVX has an answer with different shuffle strategies depending on shift count. But it's only trying to emulate a shift, not a concat / lane-crossing palignr.

vperm2i128 is fast on Intel mainstream CPUs (but still a lane-crossing shuffle so 3c latency), but slow on Ryzen (8 uops with 3c latency/3c throughput). If you were tuning for Ryzen, you'd want to use an if() to figure out a combination of vextracti128 to get a high lane and/or vinserti128 on a low lane. You might also want to use separate shifts and then vpblendd the results together.

Designing the right shuffles:

The indx determines where the new bytes for each lane need to come from. Let's simplify by considering 64-bit elements:

 hi |  lo
D C | B A    # a
H G | F E    # b

palignr(b,a i) forms (H G D C) >> i | (F E B A) >> i
But what we want is

D C | B A    # concatq(b,a,0): no-op.  return a;

E D | C B    # concatq(b,a,1):  applies to 16-bit element counts from 1..7
          low lane needs  hi(a).lo(a)
          high lane needs lo(b).hi(a)
        return palignr(swapmerge(a,b), a, 2*i).  (Where we use vperm2i128 to lane-swap+merge hi(a) and lo(b))
F E | D C    # concatq(b,a,2)
        special case of exactly half reg width: Just use vperm2i128.
        Or on Ryzen, `vextracti128` + `vinserti128`
G F | E D    # concatq(b,a,3): applies to 16-bit element counts from 9..15
        low  lane needs lo(b).hi(a)
        high lane needs hi(b).lo(b).  vperm2i128 -> palignr looks good
        return palignr(b, swapmerge(a,b), 2*i-16).

H G | F E    # concatq(b,a,4): no op: return b;

Interestingly, lo(b) | hi(a) is used in both palignr cases. We never need lo(a) | hi(b) as a palignr input.

These design notes lead directly to this implementation:

// UNTESTED
// clang refuses to compile this, but gcc works.

// in many cases won't be faster than simply using unaligned loads.
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b, unsigned int count) {
#endif
   if (count == 0)
     return a;
   else if (count <= 7)
     return _mm256_alignr_epi8(_mm256_permute2x128_si256(a,b,0x21),a,count*2);
   else if (count == 8)
      return _mm256_permute2x128_si256(a,b,0x21);
   else if (count > 8 && count <= 15)
     // clang chokes on the negative shift count even when this branch is not taken
     return _mm256_alignr_epi8(b,_mm256_permute2x128_si256(a,b,0x21),count*2 - 16);
   else if (count == 16)
     return b;
   else
     assert(0 && "out-of-bounds shift count");

// can't get this to work without C++ constexpr :/
//   else
//     static_assert(count <= 16, "out-of-bounds shift count");
}

I put it on the Godbolt compiler explorer with some test functions that inline it with different constant shift counts. gcc6.3 compiles it to

test_alignr0:
    ret            # a was already in ymm0
test_alignr3:
    vperm2i128      ymm1, ymm0, ymm1, 33   # replaces b
    vpalignr        ymm0, ymm1, ymm0, 6
    ret
test_alignr8:
    vperm2i128      ymm0, ymm0, ymm1, 33
    ret
test_alignr11:
    vperm2i128      ymm0, ymm0, ymm1, 33   # replaces a
    vpalignr        ymm0, ymm1, ymm0, 6
    ret
test_alignr16:
    vmovdqa ymm0, ymm1
    ret

clang chokes on it. First, it says error: argument should be a value from 0 to 255 for the count*2 - 16 for counts that don't use that branch of the if/else chain.

Also, it can't wait and see that the alignr() count ends up being a compile-time constant: error: argument to '__builtin_ia32_palignr256' must be a constant integer, even when it is after inlining. You can solve that in C++ by making count a template parameter:

template<unsigned int count>
static inline __m256i lanecrossing_alignr_epi16(__m256i a, __m256i  b) {
   static_assert(count<=16, "out-of-bounds shift count");
   ...

In C, you could make it a CPP macro instead of a function to deal with that.

The count*2 - 16 problem is harder to solve for clang. You could make the shift count part of the macro name, like CONCAT256_EPI16_7. There's probably some CPP trickery you could use to do the 1..7 versions and the 9..15 versions separately. (Boost has some crazy CPP hacks.)

BTW, your print function is weird. It calls the first element c[1] instead of c[0]. Vector indices start at 0 for shuffles, so it's really confusing.

yeah, i found the print function kinda weird too. And great answer btw. I'd up-vote it more than once if I could. — m0h4mm4d, Jul 27 '17 at 08:25
@Peter Cordes, Love your answers so much. Print function is nothing special. It is called for c[0] but show its c[1]. just some thing lazy... Print function prints from the first elements c[0], doesn't it? just printin is wierd instead of 0-15 I print from 1-16 — Amiri, Jul 27 '17 at 08:44
@Martin: Yeah, the print function passes `i+1` to printf instead of `i`, because whoever wrote it disagrees with me (and Intel) that vectors count from 0. — Peter Cordes, Jul 27 '17 at 08:47
Let us check : `_mm256_storeu_si256((__m256i *) &temp[0], a);` stores vectori `a` to `temp` array from 0-15. 1st Step OK? then, `printf("%s[%d]=%4d , ",name,i+1,temp[i]);` it prints temp[0] to temp[15]. 2nd OK? Every thing is alright and no data losing? — Amiri, Jul 27 '17 at 12:12
The only thing is different is that it print `temp[0]` and show it as `a[1]` ... I use it to be campatible with some of my document — Amiri, Jul 27 '17 at 12:14
@Martin: Yeah it prints everything. My only point is that your function (and documents) should number vectors from 0 (that's why I said "confusing" rather than "buggy"). It's obviously doing it that way on purpose. — Peter Cordes, Jul 27 '17 at 12:21
Back in January, I was doing this exact task for run-time shifts. For 32-bit granularity with AVX2, my solution was to implement a rotation using `_mm256_permutevar8x32_epi32()` followed by `_mm256_blendv_epi8()`. For byte and bit granularity, I combined the permute vector with a bit-rotation and used a bit-granular blend. On AVX512, world-aligned rotations are still only 1 instruction. But there are now single instructions for rotation and bit-wise blend. (`vternlog`) But in all cases, the amount of setup overhead needed is high since you need to build the blend masks and permute vectors. — Mysticial, Jul 31 '17 at 19:10

How to concatenate two vector efficiently using AVX2? (a lane-crossing version of VPALIGNR)

1 Answers1

Linked