Interleaved merging of 2 AVX-512 vector elements - C intrinsic

Question

I want to merge elements of 2 AVX-512 vectors into two other vectors with the least possible number of clock cycles.

The problem specifics are as follows:

// inputs
__m512i a = {a0, a1, ..., a31}; // 32x 16-bit int16_t integers
__m512i b = {b0, b1, ..., b31}; // 32x 16-bit int16_t integers

// desired output
__m512i A = {a0 , b0 , a1 , b1 , ..., a15, b15};
__m512i B = {a16, b16, a17, b17, ..., a31, b31};

The naive way is to copy the vectors (a and b) to memory and create vectors (A and B) by direct indexing as below:

union U512i {
    __m512i vec;
    alignas(64) int16_t vals[32];
};

U512i ta = { a };
U512i tb = { b }

U512i A = _mm512_set_epi16( tb.vals[15], ta.vals[15], ... tb.vals[0], ta.vals[0] );
U512i B = _mm512_set_epi16( tb.vals[31], ta.vals[31], ... tb.vals[16], ta.vals[16] );

I would also need to do similar merges but with different strides, for example:

// inputs
__m512i a = {a0, a1, ..., a31}; // 32x 16-bit int16_t integers
__m512i b = {b0, b1, ..., b31}; // 32x 16-bit int16_t integers

// desired output
__m512i A = {a0 , a1 , b0 , b1 , ..., a14, a15, b14, b15};
__m512i B = {a16, a17, b16, b17, ..., a30, a31, b30, b31};

What are the most suitable AVX-512 intrinsics to solve this problem? Some explanation would be greatly appreciated as I am a newbie to AVX-512 intrinsics.

Thank you for your help!

Have you looked at `_mm512_mask_blend_epi16` combined with some shuffles? — paddy, Sep 10 '20 at 05:18
`vpermt2w` can do this in one instruction per output. Or on some CPUs where that costs 3 uops, `vpunpcklwd` + `vpunpckhwd` and then fix up that in-lane interleave with 2x single-uop `vpermt2d` on those results should work for a total of 4 shuffle uops instead of 6. — Peter Cordes, Sep 10 '20 at 07:53
The version that keeps pairs adjacent is equivalent to 32-bit element granularity, so you can just use single-uop `vpermt2d`. — Peter Cordes, Sep 10 '20 at 07:56
@PeterCordes, thank you for your suggestions. My CPU (Skylake) supports vpermt2w. While _mm512_mask_permutex2var_epi16 solves my problem, it is a bit slow (7 cycles). In fact, compared with the naive way (by transferring to memory), the performance stayed almost the same. — caesar, Sep 10 '20 at 14:11
`vpermt2w` is 3 uops and has a *throughput* of one per 2 cycles on SKX. Yes it's not ideal, but the latency of the two independent shuffles to produce A and B can overlap. https://www.uops.info/table.html?search=vpermt2w&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKX=on&cb_measurements=on&cb_avx512=on Are you sure a compiler like clang isn't already compiling what you're doing into a shuffle like that? Unless your benchmark is poorly designed, or your real use-case bottlenecks elsewhere, or your compiler already optimized your naive way nicely, there should be room to gain here. — Peter Cordes, Sep 10 '20 at 14:37
With an actual back-end uop cost of 1p05 + 2p5 for one `vpermt2w`, the back-end bottleneck should be the 4 uops for port 5, same as you'd get with my 2x `vpunpck` + 2x `vpermt2d` idea. Possible there's something even better you could do with a merge-mask to blend and shuffle at the same time with a cheaper shuffle. Ice Lake could also give speedups with its extra shuffle unit, and fast `vpermb`, but you have Skylake-X. — Peter Cordes, Sep 10 '20 at 14:40
Can you link a version of your naive merge function on https://godbolt.org/? I'm curious to see how it compiles with GCC and clang, but not curious enough to type out all the indices for `__m512i get_A(__m512i a, __m512i b)`. — Peter Cordes, Sep 10 '20 at 14:43
(Using memory transfer) https://godbolt.org/z/4c8E3e (Using vpermt2w) https://godbolt.org/z/ezrPP6 Compiler: g++ (GCC) 10.1.0 Options: -c -march=skylake-avx512 -mtune=skylake-avx512 -O3 CPU: Intel(R) Xeon(R) Platinum 8170 CPU @ 2.10GHz Microarchitecture: -march = skylake-avx512 OS: ArchLinux (5.7.12-arch1-1) — caesar, Sep 11 '20 at 02:14
@PeterCordes Did you manage to look at my implementations mentioned in my previous comment. — caesar, Sep 12 '20 at 04:13
Only seeing your comments now because you only @user notified me in the 2nd one. "memory transfer" is faster because the input was a compile-time constant. Look at the asm, it just loads the constant result. Unfortunately GCC failed to do constant-propagation through `_mm512_mask_permutex2var_epi16` in that version, so the real work (for the one result that's used) is still there. — Peter Cordes, Sep 12 '20 at 09:57
Write a function that takes 2 vectors as input and returns A or B; see [How to remove "noise" from GCC/clang assembly output?](https://stackoverflow.com/q/38552116) for an explanation of how to see optimized asm without constant-propagation being a problem. Or show a benchmark loop if you want to see what you're actually timing. — Peter Cordes, Sep 12 '20 at 09:58
@PeterCordes: I updated the source as you suggested. Now I have a better understanding of what is going on. If you add your answer, I will accept it. — caesar, Sep 12 '20 at 15:22
Feel free to answer your own question, including whatever intrinsics you came up with based on my initial comment. — Peter Cordes, Sep 12 '20 at 15:24

score 2 · Accepted Answer · edited Sep 12 '20 at 15:54

Thanks to the comments mentioned above, one way to solve this problem is using vpermt2w or the intrinsic _mm512_mask_permutex2var_epi16.

On Skylake-avx512 and Ice Lake CPUs (https://uops.info/), vpermt2w decodes to 3 uops (2 of which can only run on port 5). Overall it has 7 cycle latency, with a throughput of one per 2 cycles.

The optimized code using vpermt2w is as follows:

#include <immintrin.h>
#include <inttypes.h>

void foo(__m512i a, __m512i b) {

    __m512i A, B;
    __m512i idx1 = _mm512_set_epi16( 47, 15, 46, 14, 45, 13, 44, 12, 43, 11, 42, 10, 41, 9, 40, 8, 39, 7, 38, 6, 37, 5, 36, 4, 35, 3, 34, 2, 33, 1, 32, 0 );
    __m512i idx2 = _mm512_set_epi16(
        47 + 16, 15 + 16, 46 + 16, 14 + 16, 45 + 16, 13 + 16, 44 + 16, 12 + 16, 43 + 16, 11 + 16, 42 + 16, 10 + 16, 41 + 16, 9 + 16, 40 + 16, 8 + 16,
        39 + 16, 7 + 16, 38 + 16, 6 + 16, 37 + 16, 5 + 16, 36 + 16, 4 + 16, 35 + 16, 3 + 16, 34 + 16, 2 + 16, 33 + 16, 1 + 16, 32 + 16, 0 + 16 );

    A = _mm512_mask_permutex2var_epi16( a, 0xFFFFFFFF, idx1, b );
    B = _mm512_mask_permutex2var_epi16( a, 0xFFFFFFFF, idx2, b );
}

And the naive way is shown here for reference, but it compiles very inefficiently with GCC for input vectors that aren't compile-time constants.

#include <immintrin.h>
#include <inttypes.h>

union U512i {
    __m512i vec;
    alignas(64) int16_t vals[32];
};

void foo(__m512i a, __m512i b) {

    __m512i A, B;

    U512i u_a = { a };
    U512i u_b = { b };
    A = _mm512_set_epi16 (
            u_b.vals[15], u_a.vals[15], u_b.vals[14], u_a.vals[14],
            u_b.vals[13], u_a.vals[13], u_b.vals[12], u_a.vals[12],
            u_b.vals[11], u_a.vals[11], u_b.vals[10], u_a.vals[10],
            u_b.vals[9], u_a.vals[9], u_b.vals[8], u_a.vals[8],
            u_b.vals[7], u_a.vals[7], u_b.vals[6], u_a.vals[6],
            u_b.vals[5], u_a.vals[5], u_b.vals[4], u_a.vals[4],
            u_b.vals[3], u_a.vals[3], u_b.vals[2], u_a.vals[2],
            u_b.vals[1], u_a.vals[1], u_b.vals[0], u_a.vals[0]
            );

    B = _mm512_set_epi16 (
            u_b.vals[31], u_a.vals[31], u_b.vals[30], u_a.vals[30],
            u_b.vals[29], u_a.vals[29], u_b.vals[28], u_a.vals[28],
            u_b.vals[27], u_a.vals[27], u_b.vals[26], u_a.vals[26],
            u_b.vals[25], u_a.vals[25], u_b.vals[24], u_a.vals[24],
            u_b.vals[23], u_a.vals[23], u_b.vals[22], u_a.vals[22],
            u_b.vals[21], u_a.vals[21], u_b.vals[20], u_a.vals[20],
            u_b.vals[19], u_a.vals[19], u_b.vals[18], u_a.vals[18],
            u_b.vals[17], u_a.vals[17], u_b.vals[16], u_a.vals[16]
            );

}

`vpermt2w` has a latency of 7 cycles, but your use-case has instruction-level parallelism. Instructions don't have a single cost in cycles that you can add up, that's not how performance works on out-of-order execution CPUs. Also, it's *not* faster on Ice Lake, still 3 uops, 7 cycle latency. — Peter Cordes, Sep 12 '20 at 15:50
@PeterCordes. Agreed on the ILP and CPI. According to [this](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=3891,4185,4223&techs=AVX_512&text=vpermt2w), the latency is shown as "-", do you know what that means? — caesar, Sep 12 '20 at 16:02
It means that Intel's intrinsics guide isn't sufficiently detailed for real performance analysis, as usual. It mostly only has real info for single-uop instructions. Or maybe it's because that intrinsic can compile to `vpermi2w` or `vpermt2w`; the intrinsics guide also doesn't try to show perf info for intrinsics that don't have an exact 1:1 mapping with asm. This one is always one or the other (unless constant-propagation removes it or it optimizes to something else) but maybe that's part of why Intel left their table incomplete. TL:DR: That Intel guide is not a good source for perf analysis — Peter Cordes, Sep 12 '20 at 16:09

Interleaved merging of 2 AVX-512 vector elements - C intrinsic

1 Answers1