I want to merge elements of 2 AVX-512 vectors into two other vectors with the least possible number of clock cycles.
The problem specifics are as follows:
// inputs
__m512i a = {a0, a1, ..., a31}; // 32x 16-bit int16_t integers
__m512i b = {b0, b1, ..., b31}; // 32x 16-bit int16_t integers
// desired output
__m512i A = {a0 , b0 , a1 , b1 , ..., a15, b15};
__m512i B = {a16, b16, a17, b17, ..., a31, b31};
The naive way is to copy the vectors (a and b) to memory and create vectors (A and B) by direct indexing as below:
union U512i {
__m512i vec;
alignas(64) int16_t vals[32];
};
U512i ta = { a };
U512i tb = { b }
U512i A = _mm512_set_epi16( tb.vals[15], ta.vals[15], ... tb.vals[0], ta.vals[0] );
U512i B = _mm512_set_epi16( tb.vals[31], ta.vals[31], ... tb.vals[16], ta.vals[16] );
I would also need to do similar merges but with different strides, for example:
// inputs
__m512i a = {a0, a1, ..., a31}; // 32x 16-bit int16_t integers
__m512i b = {b0, b1, ..., b31}; // 32x 16-bit int16_t integers
// desired output
__m512i A = {a0 , a1 , b0 , b1 , ..., a14, a15, b14, b15};
__m512i B = {a16, a17, b16, b17, ..., a30, a31, b30, b31};
What are the most suitable AVX-512 intrinsics to solve this problem? Some explanation would be greatly appreciated as I am a newbie to AVX-512 intrinsics.
Thank you for your help!