That's a weird shuffle initializer; use _mm512_set_epi64
since you're using it with a _pd
shuffle that interprets the elements as 64-bit integers, not epi32
.
1 single-uop shuffle is fine, and isn't a problem for back-end port pressure on port 5 unless your surrounding code has lots of shuffles. (And if it does, you're out of luck since Intel CPUs only run 512-bit shuffles on port 5).
Out-of-order exec can hide the latency, and you can't do any better anyway.
Even if r
came from a load (instead of another calculation), I don't think there's any scope for using only in-lane shuffles. e.g. starting with a 128-bit broadcast-load doesn't work because you need elements 2 and 3, not just 0 and 1.
Some shuffles like vshufpd ymm
can run on p1/p5 on Ice Lake and newer, but that doesn't help for 512-bit shuffles; the vector ALUs on port 1 are shut down while 512-bit uops are in flight. So any shuffle will be at best 1c throughput (which is fine, with 2 multiplies per shuffle you aren't bottlenecked on shuffle ports in the back-end).
You need a lane-crossing shuffle (since you can't do a 128-bit broadcast and then vshufpd
or vpermilps/pd
) so it has to be 3c latency, but out-of-order exec can hide that latency unless it's the critical path of a long (loop carried) dependency chain.
Semi-related in general: Do 128bit cross lane operations in AVX512 give better performance?
If you had lots of spare front-end bandwidth but fully bottleneck on back-end ALU execution ports, you could maybe get the shuffle done with two vbroadcastsd
loads, the 2nd merge-masking.
The first load could be just 256-bit, from ptr+16
where char *ptr
points at the start of r
or where you would have loaded it from, so broadcasting r[2]
.
Except this plan doesn't work at all because masked broadcast-loads need an ALU uop, so that would take a port-5 uop as well as two p2/p3 uops on Ice Lake for example. https://uops.info/ shows Intel and AMD (Zen 4) both work this way, so we can't relieve back-end ALU port pressure with broadcast-loads of separate scalar elements + merge-masking instead of shuffling.
vinsertf64x4
doesn't help either.
If the elements you want are adjacent like here, a 128-bit broadcast load can get the element you want into each 128-bit lane. That would set up for vpermilpd
(https://www.felixcloutier.com/x86/vpermilpd) which can use a shuffle constant like 0xf0
to get the higher element in each of the upper 4 doubles, and the lower in each of the lower 4 doubles.
That's 1c latency instead of 3c, but is only viable if you already had the source in memory. A store/reload would introduce more latency.
If you're ever taking elements from two separate vectors, there's vpermt2d
which has two 512-bit inputs and one 512-bit output, and the control vector can have each output pull data from any input element across the two vectors.