Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

Question

Is there an intrinsic or another efficient way for repacking high/low 32-bit components of 64-bit components of AVX register into an SSE register? A solution using AVX2 is ok.

So far I'm using the following code, but profiler says it's slow on Ryzen 1800X:

// Global constant
const __m256i gHigh32Permute = _mm256_set_epi32(0, 0, 0, 0, 7, 5, 3, 1);

// ...

// function code
__m256i x = /* computed here */;
const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(x,
  gHigh32Permute)); // This seems to take 3 cycles

So you want to extract the odd or even-numbered 32-bit elements? i.e. like AVX512 `_mm256_cvtepi64_epi32` (`vpmovqd`)? I don't think you're going to beat 1 shuffle instruction with 3-cycle latency, because lane-crossing shuffles always have 3c latency on Intel CPUs. Your `vpermd` solution has single-cycle throughput. — Peter Cordes, Aug 24 '17 at 17:24
If you need it to be faster, you're going to have to make the surrounding code use it less, or not require lane-crossing or something! Or maybe somehow pack two sources into a 256b result with `shufps` (except it's not lane-crossing so it doesn't solve your problem, and there's no `vpackqd` instruction and pack instructions aren't lane-crossing either.) — Peter Cordes, Aug 24 '17 at 17:27
@PeterCordes, yes, I want to extract odd- or even-numbered 32-bit elements from a 256-bit register to a 128-bit register. Thanks for the reference to AVX512! I don't have it on Ryzen 1800X, but looking forward to migrate to it once... These 32-bit elements are high and low parts of 64-bit double's, so I don't see a way to change the surrounding code. — Serge Rogatch, Aug 24 '17 at 18:08
Well do they have to be in a `__m128i`, or can you use an in-lane shuffle to put the low and high halves into the bottom 2 elements of each lane of a `__m256i`? If you're tuning for Ryzen, it probably does make sense to get it down to 128b, though. But maybe `vextractf128` and then use a 2-source shuffle (like `shufps`) will be better on Ryzen, where lane-crossing shuffles are very slow. — Peter Cordes, Aug 24 '17 at 18:15

score 5 · Accepted Answer · edited Aug 27 '23 at 02:58

That shuffle+cast with _mm256_permutevar8x32_ps is optimal for one vector on Intel and Zen 2 or later. One one-uop instruction is the best you can get. (Two uops on AMD Zen 2 and Zen 3. One uop on Zen 4. https://uops.info/)

Use vpermps instead of vpermd to avoid any risk for int / FP bypass delay if your input vector was created by a pd instruction rather than a load or something. Using the result of an FP shuffle as an input to an integer instruction is usually fine on Intel (I'm less sure about feeding the result of an FP instruction to an integer shuffle).

If tuning for Intel, you can change the surrounding code so that you can shuffle into the bottom 64-bits of each 128-bit lane. It avoids a lane-crossing shuffle. (Then you can just use vshufps ymm, or if tuning for KNL, vpermilps since 2-input vshufps is slower.)

With AVX512, there's _mm256_cvtepi64_epi32 (vpmovqd) which packs elements across lanes, with truncation.

Lane-crossing shuffles are slow on Zen 1. Agner Fog doesn't have numbers for vpermd, but lists vpermps (which probably uses the same hardware internally) at three uops, five cycles of latency, one per four cycles of throughput. https://uops.info/ confirms those numbers for Zen 1.

Zen 2 and Zen 3 have 256-bit wide vector execution units for the most part, but sometimes their lane-crossing shuffles with elements smaller than 128-bit take multiple uops. Zen 4 improves things, like 0.5 cycles throughput vpermps with four cycles of latency.

vextractf128 xmm, ymm, 1 is very efficient on Zen 1 (1c latency, 0.33c throughput), which is not surprising since it tracks 256-bit registers as two 128-bit halves. shufps is also efficient (1c latency, 0.5c throughput), and will let you shuffle the two 128b registers into the result you want.

This also saves you a register for the vpermps shuffle mask you don't need anymore. (One vpermps to get the elements you want grouped into the high and low lanes for vextractf128. Or if latency is important, two control vectors for 2x vpermps on CPUs where it's single-uop) So for CPUs with multi-uop vpermps, especially Zen 1, I'd suggest:

__m256d x = /* computed here */;

// Tuned for Zen 1 through Zen 3.  Probably sub-optimal everywhere else.
__m128 hi = _mm_castpd_ps(_mm256_extractf128_pd(x, 1));  // vextractf128
__m128 lo = _mm_castpd_ps(_mm256_castpd256_pd128(x));    // no instructions
__m128 odd  = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(3,1,3,1));
__m128 even = _mm_shuffle_ps(lo, hi, _MM_SHUFFLE(2,0,2,0));

On Intel, using three shuffles instead of two reaches two thirds of the optimal throughput, with one cycle extra latency for the first result.

On Zen 2 and Zen 3 where vpermps is two uops vs. one for vextractf128, extract + 2x vshufps is better than 2x vpermps.

Also the E-cores on Alder Lake have two-uop vpermps but one-uop vextractf128 and vshufps xmm

I've measured that `const __m128i high32 = _mm256_castsi256_si128(_mm256_permutevar8x32_epi32(_mm256_castpd_si256(x), gHigh32Permute));` is faster than `const __m128i high32 = _mm_castps_si128( _mm256_castps256_ps128(_mm256_permutevar8x32_ps(_mm256_castpd_ps(x), gHigh32Permute) ));` . So perhaps there is also a penalty for `double` to `float` bypass? — Serge Rogatch, Aug 26 '17 at 21:22
@SergeRogatch: Unlikely for shuffles. More likely, `vpermd` just performs differently from `vpermps`. (Agner didn't list them both so I had to guess). Or that whatever you're consuming the result with does better when it's coming from an integer shuffle? AMD has had float vs. double differences for actual FP math instructions, though, according to Agner. (Almost always irrelevant of course, but it's a clue about the internal implementation, like maybe there's some extra tag bits stored with a vector.) — Peter Cordes, Aug 26 '17 at 21:25
Shouldn't `hi` and `lo` be swapped in `__m128 odd = _mm_shuffle_ps(hi, lo, _MM_SHUFFLE(3,1,3,1));` ? — Serge Rogatch, Aug 26 '17 at 22:17
@SergeRogatch: good catch, yeah the low 2 elements of the result come from the first source operand. — Peter Cordes, Aug 26 '17 at 22:25
@SergeRogatch: you said something about confusing documentation... See http://felixcloutier.com/x86/SHUFPS.html (or the original Intel vol.2 PDF it was extracted from for instructions where the diagrams get messed up). The "Operation" section has detailed pseudocode for everything, and often there are good diagrams and tables. (e.g. for cmpps, look at cmppd because it's the alphabetically first, so they put the good stuff there.) The online "intrinsics finder" is good, but sometimes has a mistake or leaves out some important detail. And it never has diagrams. — Peter Cordes, Aug 26 '17 at 22:58

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?

1 Answers1

Linked

Efficient (on Ryzen) way to extract the odd elements of a __m256 into a __m128?

1 Answers1

Linked

Efficient (on Ryzen) way to extract the odd elements of a m256 into a m128?