I suspect you meant that your a
and b
vectors had a 0.0 as the high element, not the low element. So probably you meant _mm256_setr_pd
, not set
. I'm going to assume that for the rest of the question.
You said your data comes from loads of 3-element arrays written by an external library. (You actually said "3-dimensional" arrays, but that would imply arr[][][]
.) So maybe you used masked loads to get the 4th element zeroed without reading past the end. Ideally you could make an output buffer for your external library like alignas(64) double out[8]
and pass out+0
and out+3
as pointers so the data you want would already be contiguous, no shuffle needed. Just at worst a store-forwarding stall (extra latency but minimal throughput cost) if you reload it very soon after a narrower store.
But if that won't work, you can still do a masked load to get the b
elements lined up where they belong, setting up for just a blend or merge-masking, using the unaligned-load hardware to essentially shuffle as part of the load, as mentioned in this answer. This has to be a 64-byte load, so it will definitely be a cache-line split unless the first element happens to start 3 doubles into a cache line.
Since you need to load from before the start of the actual array b
comes from, you can use masking to make sure that's safe (fault suppression), although if it would have faulted without masking it can be quite slow. Instead of actually blending, it can be a merge-masked load into a
.
__m512d merge_load(const double *array_A, const double *array_B, double c){
//alignas(32) double arraya[3];
//double arrayb[3];
__m256d a = _mm256_maskz_loadu_pd(0b0111, array_A); // zero masking load of 3 doubles at the bottom of a YMM. If your array actually has a 0.0 after, you don't need a mask, that's better.
// zero-extending to 512 is free because we just loaded, not like a function arg.
__m512d ab = _mm512_mask_loadu_pd(_mm512_zextpd256_pd512(a), 0b00111000, array_B-3);
// relies on the compiler to optimize away the useless zero-extension in _mm_set_sd
// clang is fine, GCC isn't. GCC insanely does vmovsd xmm1, xmm0,xmm0 merging into itself, not even a movq zero-extension
__m512d d = _mm512_mask_broadcastsd_pd(ab, 0b01000000, _mm_set_sd(c));
return d;
}
This might be slower than narrower loads that don't stall, followed by a couple shuffles. A cache-line split and a store-forwarding stall, if the data was stored recently, might be too much latency for out-of-order exec to hide, depending on surrounding code. And store-forwarding stalls don't pipeline with other SF stalls on Intel, although they do with other loads including fast-path store forwarding. A cache-line split load also has some throughput cost in terms of load-port cycles and split buffers.
Another option would be a masked 256-bit load to merge bc
and set up for a vpermt2pd
shuffle. Or an AVX2 vblendpd
if you can load from array_B - 1
without masking for fault-suppression.
Clang compiles this as-written, with no instructions for the _mm_set_sd
: only the low element of that __m128d
is read, so it's pointless to actually zero-extend. (Unlike GCC, which wastes an instruction to do nothing useful, not actually even zero-extending, just merging the same value into garbage with vmovsd xmm1, xmm0, xmm0
. Unlike movq
, and unlike the load form, reg-reg movsd
is a merge.)
This needs 3 separate mask constants, which take multiple uops each to set up. (e.g. mov
-immediate + kmov
. Or on Zen4, GCC chooses to load from memory. On Intel CPUs, kmov k, mem
is still 2 uops, just like loading into an integer register. Hopefully it's cheaper on AMD.) A shuffle like vpermt2pd
would work with a vector constant instead of a mask for the final step.
# clang15 -O3 -march=skylake-avx512
merge_load(double const*, double const*, double):
mov al, 7
kmovd k1, eax
vmovupd ymm1 {k1} {z}, ymmword ptr [rdi] # maskz_loadu
mov al, 56
kmovd k1, eax
vmovupd zmm1 {k1}, zmmword ptr [rsi - 24] # mask_loadu (merge-masking)
mov al, 64
kmovd k1, eax
vbroadcastsd zmm1 {k1}, xmm0 # mask_broadcastsd merge
vmovapd zmm0, zmm1
ret
If it picked 3 separate k1..7
registers, the masks could stick around, except across function calls that don't inline. I indented that work separately. But if you're calling an external function, there are no call-preserved k
mask or YMM or ZMM regs, so probably a 64-byte vector constant is best if you have to reload every time; it can stay hot in cache, so just the uop count and port pressure matters.
Another strategy, if your a
and b
vectors are zero-extended to 512-bit, is one shuffle and one cheap vorpd
that can run on any port. Unfortunately there's no vblendpd z,z,z, imm
, which would be ideal, using an immediate control instead of a mask register. But there's only the AVX1 forms for XMM and YMM. (And legacy SSE4.1)
Fully avoiding vector and mask constants would be more expensive. You can vbroadcastsd y,x
on c
+ AVX1 vblendpd ymm..., imm
to create bc = {b0, b1, b2, c}
, but then what? vinsertf64x4
can get everything into one register, but vpermpd zmm, zmm, imm8
does the same shuffle within two 256-bit halves. There aren't enough immediate bits in an imm8 for an 8-element shuffle. So maybe valignq
(with itself) to rotate bc into place and vorpd
instead of a blend to combine.
But a good tradeoff appears to be:
// Efficient if a is already zero-extended to 512
__m512d merge_low3_manual(__m256d a, __m256d b, double c)
{
const __m512i d_perm_idx = _mm512_setr_epi64(3, 3, 3, 0, 1, 2, 8, 3);
// 0 0 0 b0 b1 b2 c 0 (low element on the left)
__m512d bc = _mm512_permutex2var_pd(_mm512_castpd256_pd512(b), d_perm_idx, _mm512_castpd128_pd512(_mm_set_sd(c)));
//return _mm512_blend_pd(_mm512_castpd256_pd512(a), bc, 0b11111000); // there is no vblendpd z,z,z, imm8
return _mm512_or_pd(_mm512_zextpd256_pd512(a), bc); // a zero-extends for free if the compiler can see where it was created.
}
Silly clang, defeating mov-elimination by picking the same register for vmovapd ymm0, ymm0
when it can't optimize away zero-extension.
merge_low3_manual(double __vector(4), double __vector(4), double):
vmovapd zmm3, zmmword ptr [rip + .LCPI0_0] # zmm3 = [3,3,3,0,1,2,8,3]
vpermi2pd zmm3, zmm1, zmm2 # produce bc
vmovapd ymm0, ymm0 # zero extension hopefully goes away on inlining
vorpd zmm0, zmm3, zmm0
ret
So for real, this is probably just the 2 instructions (assuming your vector inputs have already been loaded as __m256d
without bothering to do any combining as part of that. Which is probably fair, merge-masking would take extra instructions to set up masks. But if you need masks for fault-suppression, use merge-masking.)
These are all untested, so I may have the wrong shuffle constants.