How to concatenate the low 3 elements from two 256-bit vectors in a 512-bit vector, and insert a scalar?

Question

Even though it is odd and non-canonical, I would like to concatenate two __m256d and a double in a single __m512d. Specifically, I have

__m256d a = _mm256_set_pd(1, 2, 3, 0);
__m256d b = _mm256_set_pd(4, 5, 6, 0);
double c = 7;

At the end, I would like to have

__m512d d {1, 2, 3, 4, 5, 6, 7, 0}

Is there a fast way of doing this with Intel intrinsics?

https://stackoverflow.com/questions/11116769/how-to-combine-two-m128-values-to-m256might help — Alan Birtles, Nov 14 '22 at 16:43
Perhaps merge-masked-broadcast to insert the scalar into the end of one vector to make `(4,5,6,7)`? Or just blend, since as written the `0` is the *low* element of those vectors; you used `set` not `setr` so the first arg is the highest, opposite of array initializer order, or the braced initializer you used for `__m512d d`. Anyway, insert the scalar into one, then [`vpermt2pd`](https://www.felixcloutier.com/x86/vpermt2w:vpermt2d:vpermt2q:vpermt2ps:vpermt2pd), i.e. `_mm512_permutex2var_pd`. — Peter Cordes, Nov 15 '22 at 07:30
If there's a 2-input shuffle that can write the result to a 3rd register, that could be usable with merge-masking to get everything done in one instruction. `vpermt2pd` isn't like that, it only has 3 operands, all 3 of them inputs (shuffle control vector and 2 "tables"), not a write-only destination. I don't think `valignq` would work either; we could "shift in" the `7.0` next to the `4.0` and shift out the `0.0`, but not replace the `0.0`. — Peter Cordes, Nov 15 '22 at 07:40
Is the high element of the result always `0`, or does that come from one of the inputs? (Are they guaranteed zero?) I was curious if clang or GCC could optimize a naive `_mm512_setr_pd(a[0], ..., b[0], ..., 7.0, 0)` into anything not terrible: https://godbolt.org/z/xzvqqsfzK . (`a[3]` to access an element is a GNU extension, this was just an experiment to see how it compiled so I didn't care about portability, and `setr_pd` is emphatically *not* a recommendation.) Mostly terrible asm, although clang does manage only 2 shuffle instructions when the high 2 elements (7.0, 0.0) are constants. — Peter Cordes, Nov 15 '22 at 08:22
Are you doing this in a loop, where vector / mask constants can be reused, or is this a one-off where any vector shuffle or `__mmask` constants would take extra instructions to load or `mov`+`kmov`? If the latter, `vbroadcastsd` + `vblendpd` is probably a good way to insert into `b`. Then merge-masked `valignq` to shift and blend, or `vpermt2pd`. — Peter Cordes, Nov 15 '22 at 08:31
https://godbolt.org/z/TdTcoc314 has one attempt that only uses immediates, although including a mask constant for a merge-masked `valignq`. (And still using `b[3] = c;` to leave that part up toe the compiler; only clang does a good job.) For Zen4, might be optimal to `vinsertf64x4` to combine b and c or a and b, then `vpermt2d` with a constant to include the other and shuffle; Zen4 has fast `vinsertf64x4`, single-cycle latency. — Peter Cordes, Nov 15 '22 at 08:42
Thank you for all the suggestions. @PeterCordes, yes, the high element is always zero, since `a` and `b` actually come from the load of a 3-dimensional double arrays. The fact is that than I am able to do it in a naive way (i.e., put c in b, and then applying `vpermt2pd`), but I was hoping there was a smarter way to do it. — CaG, Nov 15 '22 at 08:49
Ok, well is there a specific CPU you're optimizing for? Zen4, vs. Skylake-X vs. Icelake, or just generic not-bad-anywhere? Is `c` usually already in a register, or coming from memory? Is this in a loop that would let you reuse constants? If `a` and `b` are coming from memory, is it safe to do a 64-byte load from 3 elements before `b`, so you can `vblendpd` instead of shuffling? (so you get a vector of `{x, x, x, b0, b1, b2, 0, x}` where `x` is don't-care as the memory source operand for a blend). Using unaligned-load hardware to replace shuffles is a useful technique. — Peter Cordes, Nov 15 '22 at 08:57
I assume you know that storing your data in a more SIMD-friendly way in the first place would avoid needing shuffles like this, e.g. not x,y,z,0 structs, but instead an array of xs, an array of ys, etc. — Peter Cordes, Nov 15 '22 at 08:59
@PeterCordes, I reply point by point. I am optimizing for a Cascade Lake-X processor, and I am in a loop where `a`, `b` and `c` change at each iteration (computed by an external library as two double 3-dimensional arrays and a double, respectively). — CaG, Nov 15 '22 at 09:09
When you say 3 "dimensional", you don't mean `double a[P][Q][R]`, do you? You probably mean "of size 3" aka "of dimension 3", not "3 dimensional". If that library takes a pointer to where to output the arrays, can you have one `double output[8]` and pass it pointers to `output+0` and `output+3`? Or does the library actually write a `0` as well? Are you just storing these merged vectors right away to pack them into some larger array? If so, multiple stores overlapping stores could make sense. — Peter Cordes, Nov 15 '22 at 09:14

score 1 · Answer 1 · answered Nov 14 '22 at 16:58

1

See this link for an example: https://community.intel.com/t5/Intel-ISA-Extensions/How-to-convert-two-m256d-to-one-m512d-using-intrinsics/m-p/1062934

__m256d a;
__m256d b;
__m512d c = _mm512_insertf64x4(_mm512_castpd256_pd512(a), b, 1);

answered Nov 14 '22 at 16:58

Sven Nilsson

1,861
10
11

Yeah, this seems to be the way to go. There doesn't seem to be a `_mm512_set...` equivalent of `_mm256_set_m128d` (which takes two `__m128d` args and returns `__m256d`) so there's no choice but to do it manually, and this is the right choice of instruction. (Although even Zen4 has single-uop `VSHUFF64x2` (2-input shuffle with 128-bit granularity, like `vperm2f128`), unlike how `vperm2f128` was a performance disaster on Zen1; Zen4 has some 512-bit wide execution units, e.g. for shuffles.) Still, inserting a whole 256-bit chunk is simpler to write and you'd expect to be as good for any CPU. – Peter Cordes Nov 14 '22 at 17:12
Just realized the question was *not* asking about concatenating 256-bit vectors, it was asking about concatenating 192-bit parts of __m256d, and inserting a scalar. I updated the title. IDK how many future readers will end up here looking for this, but maybe we can leave it, maybe with a mention of the fact that this solves the original-title question. – Peter Cordes Nov 15 '22 at 07:59

Andrey Semashev · Answer 2 · 2022-11-15T13:17:49.323

0

How about this:

// Note: _mm256_set_pd lists elements from top to bottom, meaning in this case
//       the lowest elements are zero.
__m256d a = _mm256_set_pd(1, 2, 3, 0);
__m256d b = _mm256_set_pd(4, 5, 6, 0);
double c = 7;

// Expected result is:
// __m512d d {1, 2, 3, 4, 5, 6, 7, 0}
// Note: Here the elements are listed from bottom to top, meaning that the
//       last (i.e. upper) element is zero.

// Insert c into the lower element of b.
// We rely on that c is already in an xmm register and all upper elements are
// likely zero, so _mm_set_sd and _mm256_zextpd128_pd256 are likely optimized
// away. We also rely on that the lowest element of b is zero.
// If the lowest element of b is not zero, use _mm256_blend_pd here instead.
// If the order of elements in b is different, use _mm256_permutex2var_pd or,
// as suggested by Peter Cordes in the comments, _mm512_insertf64x2
// or _mm256_mask_broadcastsd_pd.
__m256d bc = _mm256_or_pd(b, _mm256_zextpd128_pd256(_mm_set_sd(c)));

// Merge and reorder elements of a and bc.
// We rely on that the lowest element of a is zero, which we move to the top
// element of d. If that is not the case and you still want zero in the top
// element of d, you can use _mm512_maskz_permutex2var_pd here with a mask
// of 0b01111111.
const __m512i d_perm_idx = _mm512_setr_epi64(3, 2, 1, 11, 10, 9, 8, 0);
__m512d d = _mm512_permutex2var_pd(
    _mm512_castpd256_pd512(a), d_perm_idx, _mm512_castpd256_pd512(bc));

In case if your input vectors are actually reversed from your code snippet (i.e. the zeros are actually in the top elements), here is the updated code:

// Note the 'r' in _mm256_setr_pd
__m256d a = _mm256_setr_pd(1, 2, 3, 0);
__m256d b = _mm256_setr_pd(4, 5, 6, 0);
double c = 7;

// Insert c into the upper half of b.
__m512d bc = _mm512_insertf64x2(_mm512_castpd256_pd512(b), _mm_set_sd(c), 2);

// Merge and reorder elements of a and bc.
const __m512i d_perm_idx = _mm512_setr_epi64(0, 1, 2, 8, 9, 10, 12, 3);
__m512d d = _mm512_permutex2var_pd(_mm512_castpd256_pd512(a), d_perm_idx, bc);

edited Nov 15 '22 at 13:17

answered Nov 15 '22 at 10:58

Andrey Semashev

10,046
1
17
27

I'm pretty sure the question has its vectors backwards, and they meant `_mm512_setr_pd` with the *high* element being zero. That's one reason I didn't post an answer, just comments. But for this version, interesting idea, but `vmovq`/`vpor` takes more uops than `vblendps`. Clang at least can optimize away `_mm_set_sd(c)` to zero instructions if the blend control only uses the low element of that result. https://godbolt.org/z/7dfx6cx18 . (GCC wastes a ton of instructions.) `vblendps ymm,ymm, imm8` is 1 uop with 1c latency for any port on SKX and Zen4. – Peter Cordes Nov 15 '22 at 11:09
Yes, I had suspected that the order of the inputs might be reversed. Hence the suggestion to use `_mm256_permutex2var_pd` in that case. – Andrey Semashev Nov 15 '22 at 11:12
1

A good option to get `b` and `c` into one vector is `vinsertf64x2 zmm1, zmm1, xmm2, 2`. It's a ZMM instruction, but it doesn't need a shuffle or mask constant. `a` still has a `0` element we can grab with the final shuffle, and so would `bc` at the top of the low half. Otherwise a merge-masking `vbroadcastsd ymm{k}` is good if we don't mind having a constant, or merge-masking `valignq`; those can both be YMM shuffles so potentially cheaper on Zen4. – Peter Cordes Nov 15 '22 at 11:18
@PeterCordes `vinsertf64x2` inserts 128 bits, not a 64-bit double. Although you could pull the inserted c element from the upper lane later with `vpermt2pd`. `vbroadcastsd` seems like a good idea as well. – Andrey Semashev Nov 15 '22 at 11:23
Yeah, exactly, put it into the upper half of a ZMM where the final `vptermt2pd` can get it. If `c` is in an XMM register, it's safe to read the values in the upper parts with `vinsertf64x2 z,z,x,2` or `vinsertf64x4 z,z, y, 1` even though they might be non-zero garbage. You just need all the data in the same register, not necessarily in the right places. The final `vpermt2pd` shuffle control doesn't have to read those elements, it just needs a guaranteed zero somewhere, like from `bc[3]` or `a[3]`. – Peter Cordes Nov 15 '22 at 11:54
Oh, maybe `vpermt2pd zmm` to combine `b` and `c` *and put them in the right place for a cheap `vblendpd zmm`*. That's probably best, just two instructions and one of them can run on any port. (And one 64-byte vector constant, although it could be loaded with `vpmovzxbq` if that can be hoisted outside of a loop.) – Peter Cordes Nov 15 '22 at 11:58
Except there is no `vblendpd z,z,z, imm8`, only one that takes a mask :/ Oops. If the upper part of `a` is known to be zero, we could `vorpd`. The OP says it's a load result. https://godbolt.org/z/34KPcf1W4 – Peter Cordes Nov 15 '22 at 12:06

Peter Cordes · Accepted Answer · 2022-11-16T08:53:11.843

I suspect you meant that your a and b vectors had a 0.0 as the high element, not the low element. So probably you meant _mm256_setr_pd, not set. I'm going to assume that for the rest of the question.

You said your data comes from loads of 3-element arrays written by an external library. (You actually said "3-dimensional" arrays, but that would imply arr[][][].) So maybe you used masked loads to get the 4th element zeroed without reading past the end. Ideally you could make an output buffer for your external library like alignas(64) double out[8] and pass out+0 and out+3 as pointers so the data you want would already be contiguous, no shuffle needed. Just at worst a store-forwarding stall (extra latency but minimal throughput cost) if you reload it very soon after a narrower store.

But if that won't work, you can still do a masked load to get the b elements lined up where they belong, setting up for just a blend or merge-masking, using the unaligned-load hardware to essentially shuffle as part of the load, as mentioned in this answer. This has to be a 64-byte load, so it will definitely be a cache-line split unless the first element happens to start 3 doubles into a cache line.

Since you need to load from before the start of the actual array b comes from, you can use masking to make sure that's safe (fault suppression), although if it would have faulted without masking it can be quite slow. Instead of actually blending, it can be a merge-masked load into a.

__m512d merge_load(const double *array_A, const double *array_B, double c){
    //alignas(32) double arraya[3];
    //double arrayb[3];

   __m256d a = _mm256_maskz_loadu_pd(0b0111, array_A);    // zero masking load of 3 doubles at the bottom of a YMM.  If your array actually has a 0.0 after, you don't need a mask, that's better.

   // zero-extending to 512 is free because we just loaded, not like a function arg.
   __m512d ab = _mm512_mask_loadu_pd(_mm512_zextpd256_pd512(a), 0b00111000, array_B-3);

   // relies on the compiler to optimize away the useless zero-extension in _mm_set_sd
   // clang is fine, GCC isn't.  GCC insanely does vmovsd xmm1, xmm0,xmm0 merging into itself, not even a movq zero-extension
   __m512d d = _mm512_mask_broadcastsd_pd(ab, 0b01000000, _mm_set_sd(c));
   return d;
}

This might be slower than narrower loads that don't stall, followed by a couple shuffles. A cache-line split and a store-forwarding stall, if the data was stored recently, might be too much latency for out-of-order exec to hide, depending on surrounding code. And store-forwarding stalls don't pipeline with other SF stalls on Intel, although they do with other loads including fast-path store forwarding. A cache-line split load also has some throughput cost in terms of load-port cycles and split buffers.

Another option would be a masked 256-bit load to merge bc and set up for a vpermt2pd shuffle. Or an AVX2 vblendpd if you can load from array_B - 1 without masking for fault-suppression.

Clang compiles this as-written, with no instructions for the _mm_set_sd: only the low element of that __m128d is read, so it's pointless to actually zero-extend. (Unlike GCC, which wastes an instruction to do nothing useful, not actually even zero-extending, just merging the same value into garbage with vmovsd xmm1, xmm0, xmm0. Unlike movq, and unlike the load form, reg-reg movsd is a merge.)

This needs 3 separate mask constants, which take multiple uops each to set up. (e.g. mov-immediate + kmov. Or on Zen4, GCC chooses to load from memory. On Intel CPUs, kmov k, mem is still 2 uops, just like loading into an integer register. Hopefully it's cheaper on AMD.) A shuffle like vpermt2pd would work with a vector constant instead of a mask for the final step.

# clang15 -O3 -march=skylake-avx512
merge_load(double const*, double const*, double):
        mov     al, 7
        kmovd   k1, eax
    vmovupd ymm1 {k1} {z}, ymmword ptr [rdi]     # maskz_loadu
        mov     al, 56
        kmovd   k1, eax
    vmovupd zmm1 {k1}, zmmword ptr [rsi - 24]    # mask_loadu (merge-masking)
        mov     al, 64
        kmovd   k1, eax
    vbroadcastsd    zmm1 {k1}, xmm0              # mask_broadcastsd merge
        vmovapd zmm0, zmm1
        ret

If it picked 3 separate k1..7 registers, the masks could stick around, except across function calls that don't inline. I indented that work separately. But if you're calling an external function, there are no call-preserved k mask or YMM or ZMM regs, so probably a 64-byte vector constant is best if you have to reload every time; it can stay hot in cache, so just the uop count and port pressure matters.

Another strategy, if your a and b vectors are zero-extended to 512-bit, is one shuffle and one cheap vorpd that can run on any port. Unfortunately there's no vblendpd z,z,z, imm, which would be ideal, using an immediate control instead of a mask register. But there's only the AVX1 forms for XMM and YMM. (And legacy SSE4.1)

Fully avoiding vector and mask constants would be more expensive. You can vbroadcastsd y,x on c + AVX1 vblendpd ymm..., imm to create bc = {b0, b1, b2, c}, but then what? vinsertf64x4 can get everything into one register, but vpermpd zmm, zmm, imm8 does the same shuffle within two 256-bit halves. There aren't enough immediate bits in an imm8 for an 8-element shuffle. So maybe valignq (with itself) to rotate bc into place and vorpd instead of a blend to combine.

But a good tradeoff appears to be:

// Efficient if  a  is already zero-extended to 512
__m512d merge_low3_manual(__m256d a, __m256d b, double c)
{
    const __m512i d_perm_idx = _mm512_setr_epi64(3, 3, 3, 0, 1, 2, 8, 3);
    // 0 0 0 b0   b1 b2 c 0    (low element on the left)
    __m512d bc = _mm512_permutex2var_pd(_mm512_castpd256_pd512(b), d_perm_idx, _mm512_castpd128_pd512(_mm_set_sd(c)));
    //return _mm512_blend_pd(_mm512_castpd256_pd512(a), bc, 0b11111000);  // there is no vblendpd z,z,z, imm8
    return _mm512_or_pd(_mm512_zextpd256_pd512(a), bc);  // a zero-extends for free if the compiler can see where it was created.
}

Silly clang, defeating mov-elimination by picking the same register for vmovapd ymm0, ymm0 when it can't optimize away zero-extension.

merge_low3_manual(double __vector(4), double __vector(4), double):
        vmovapd zmm3, zmmword ptr [rip + .LCPI0_0] # zmm3 = [3,3,3,0,1,2,8,3]
     vpermi2pd       zmm3, zmm1, zmm2    # produce bc
        vmovapd ymm0, ymm0               # zero extension hopefully goes away on inlining
     vorpd   zmm0, zmm3, zmm0
        ret

So for real, this is probably just the 2 instructions (assuming your vector inputs have already been loaded as __m256d without bothering to do any combining as part of that. Which is probably fair, merge-masking would take extra instructions to set up masks. But if you need masks for fault-suppression, use merge-masking.)

These are all untested, so I may have the wrong shuffle constants.

How to concatenate the low 3 elements from two 256-bit vectors in a 512-bit vector, and insert a scalar?

3 Answers3

Linked