Ideas to speeden up the given AVX512 code

Question

    mask_new1 = _mm512_set_epi32(0, 3, 0, 3, 0, 3, 0, 3, 0, 2, 0, 2, 0, 2, 0, 2);
    s1 = _mm512_permutexvar_pd(mask_new1, r);
    out1 = _mm512_mul_pd(s1, _mm512_mul_pd(b1, c1));

Are there any ways/ideas to perform faster the permute operation here?

Searched for other permute operations but could find very little clues to do it better for this mask. The latency and throughput of the operation is given to be 3 and 1 by the intrinsics guide

Peter Cordes · Answer 1 · 2023-03-22T06:24:12.123

That's a weird shuffle initializer; use _mm512_set_epi64 since you're using it with a _pd shuffle that interprets the elements as 64-bit integers, not epi32.

1 single-uop shuffle is fine, and isn't a problem for back-end port pressure on port 5 unless your surrounding code has lots of shuffles. (And if it does, you're out of luck since Intel CPUs only run 512-bit shuffles on port 5).

Out-of-order exec can hide the latency, and you can't do any better anyway.

Even if r came from a load (instead of another calculation), I don't think there's any scope for using only in-lane shuffles. e.g. starting with a 128-bit broadcast-load doesn't work because you need elements 2 and 3, not just 0 and 1.

Some shuffles like vshufpd ymm can run on p1/p5 on Ice Lake and newer, but that doesn't help for 512-bit shuffles; the vector ALUs on port 1 are shut down while 512-bit uops are in flight. So any shuffle will be at best 1c throughput (which is fine, with 2 multiplies per shuffle you aren't bottlenecked on shuffle ports in the back-end).

You need a lane-crossing shuffle (since you can't do a 128-bit broadcast and then vshufpd or vpermilps/pd) so it has to be 3c latency, but out-of-order exec can hide that latency unless it's the critical path of a long (loop carried) dependency chain.

Semi-related in general: Do 128bit cross lane operations in AVX512 give better performance?

If you had lots of spare front-end bandwidth but fully bottleneck on back-end ALU execution ports, you could maybe get the shuffle done with two vbroadcastsd loads, the 2nd merge-masking.

The first load could be just 256-bit, from ptr+16 where char *ptr points at the start of r or where you would have loaded it from, so broadcasting r[2].

Except this plan doesn't work at all because masked broadcast-loads need an ALU uop, so that would take a port-5 uop as well as two p2/p3 uops on Ice Lake for example. https://uops.info/ shows Intel and AMD (Zen 4) both work this way, so we can't relieve back-end ALU port pressure with broadcast-loads of separate scalar elements + merge-masking instead of shuffling.

vinsertf64x4 doesn't help either.

If the elements you want are adjacent like here, a 128-bit broadcast load can get the element you want into each 128-bit lane. That would set up for vpermilpd (https://www.felixcloutier.com/x86/vpermilpd) which can use a shuffle constant like 0xf0 to get the higher element in each of the upper 4 doubles, and the lower in each of the lower 4 doubles.

That's 1c latency instead of 3c, but is only viable if you already had the source in memory. A store/reload would introduce more latency.

If you're ever taking elements from two separate vectors, there's vpermt2d which has two 512-bit inputs and one 512-bit output, and the control vector can have each output pull data from any input element across the two vectors.

Thanks Peter for your reply. Actually to answer your question I need this mask for shuffling as part of converting an AVX2 to AVX512 application. In the AVX2 version we have a vector say r1 and r2 - We use the shuffle operation using mask (2,2,2,2) to fill a new vector v1 based on r1 that has 4 elements - each value in v1 being the 2nd element of r1. To do it in AVX512 we have a vector in AVX512 r that has both elements of r1 and r2. So based on this we want AVX512 vectors which have like first 4 elements to be lets say the second element in r and the next 4 elements say third element in r — Srihari S, Mar 22 '23 at 05:51
Also its not necessary that the elements in the vector need to be 2nd and 3rd element strictly - i,.e I can also have like 2nd and 6th elements also in the same way I mentioned - i.e Vector V resultant can have first 4 elements being the 2nd element of r and next 4 elements being 6th element of r. Any other advise or general advise would be welcome when it comes to AVX conversion and this issue. Thanks again @peter — Srihari S, Mar 22 '23 at 05:54
@SrihariS: Oh, I was misreading your code since you used `_mm512_set_epi32` for the initializer but you're actually using it with a `_pd` shuffle which interprets those as 64-bit elements, not 32-bit. So those are just zero-extended 64-bit integers, not a 32-bit shuffle. Use `_mm512_set_epi64` like a normal person. `VINSERTF64X4` doesn't allow a broadcast memory source so that wouldn't let you avoid a shuffle with loads. — Peter Cordes, Mar 22 '23 at 06:17

Ideas to speeden up the given AVX512 code

1 Answers1