5

I have some code using the AVX2 intrinsic _mm256_permutevar8x32_epi32 aka vpermd to select integers from an input vector by an index vector. Now I need the same thing but for 4x32 instead of 8x32. _mm_permutevar_ps does it for floating point, but I'm using integers.

One idea is _mm_shuffle_epi32, but I'd first need to convert my 4x32 index values to a single integer, that is:

imm[1:0] := idx[31:0]
imm[3:2] := idx[63:32]
imm[5:4] := idx[95:64]
imm[7:6] := idx[127:96]

I'm not sure what's the best way to do that, and moreover I'm not sure it's the best way to proceed. I'm looking for the most efficient method on Broadwell/Haswell to emulate the "missing" _mm_permutevar_epi32(__m128i a, __m128i idx). I'd rather use 128-bit instructions than 256-bit ones if possible (i.e. I don't want to widen the 128-bit inputs then narrow the result).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 2
    It's useless to generate an immediate at run-time, unless you're JITing new code. An immediate is a byte that's literally part of the machine-code instruction encoding. That's great if you have a compile-time-constant shuffle (after inlining + template expansion), otherwise forget about those shuffles. – Peter Cordes May 08 '19 at 04:18
  • 2
    Your best bet might be to just use `vpermilps` (`_mm_permutevar_ps`) on integer data via casts, if you already have a shuffle-control vector in that format. Otherwise you could translate it to a control vector for `pshufb` (`_mm_shuffle_epi8`), but that's worse than any possible extra bypass delay latency. – Peter Cordes May 08 '19 at 04:20

2 Answers2

7

It's useless to generate an immediate at run-time, unless you're JITing new code. An immediate is a byte that's literally part of the machine-code instruction encoding. That's great if you have a compile-time-constant shuffle (after inlining + template expansion), otherwise forget about those shuffles that take the control operand as an integer1.


Before AVX, the only variable-control shuffle was SSSE3 pshufb. (_mm_shuffle_epi8). That's still the only 128-bit (or in-lane) integer shuffle instruction in AVX2 and I think AVX512.

AVX1 added some in-lane 32-bit variable shuffles, like vpermilps (_mm_permutevar_ps). AVX2 added lane-crossing integer and FP shuffles, but somewhat strangely no 128-bit version of vpermd. Perhaps because Intel microarchitectures have no penalty for using FP shuffles on integer data. (Which is true on Sandybridge family, I just don't know if that was part of the reasoning for the ISA design). But you'd think they would have added __m128i intrinsics for vpermilps if that's what you were "supposed" to do. Or maybe the compiler / intrinsics design people didn't agree with the asm instruction-set people?


If you have a runtime-variable vector of 32-bit indices and want to do a shuffle with 32-bit granularity, by far your best bet is to just use AVX _mm_permutevar_ps.

_mm_castps_si128( _mm_permutevar_ps (_mm_castsi128_ps(a), idx) )

On Intel at least, it won't even introduce any extra bypass latency when used between integer instructions like paddd; i.e. FP shuffles specifically (not blends) have no penalty for use on integer data in Sandybridge-family CPUs.

If there's any penalty on AMD Bulldozer or Ryzen, it's minor and definitely cheaper than the cost of calculating a shuffle-control vector for (v)pshufb.

Using vpermd ymm and ignoring the upper 128 bits of input and output (i.e. by using cast intrinsics) would be much slower on AMD (because its 128-bit SIMD design has to split lane-crossing 256-bit shuffles into several uops), and also worse on Intel where it makes it 3c latency instead of 1 cycle.


@Iwill's answer shows a way to calculate a shuffle-control vector of byte indices for pshufb from a vector of 4x32-bit dword indices. But it uses SSE4.1 pmulld which is 2 uops on most CPUs, and could easily be a worse bottleneck than shuffles. (See discussion in comments under that answer.) Especially on older CPUs without AVX, some of which can do 2 pshufb per clock unlike modern Intel (Haswell and later only have 1 shuffle port and easily bottleneck on shuffles. IceLake will add another shuffle port, according to Intel's Sunny Cove presentation.)

If you do have to write an SSSE3 or SSE4.1 version of this, it's probably best to still use only SSSE3 and use pshufb plus a left shift to duplicate a byte within a dword before ORing in the 0,1,2,3 into the low bits, not pmulld. SSE4.1 pmulld is multiple uops and even worse than pshufb on some CPUs with slow pshufb. (You might not benefit from vectorizing at all on CPUs with only SSSE3 and not SSE4.1, i.e. first-gen Core2, because it has slow-ish pshufb.)

On 2nd-gen Core2, and Goldmont, pshufb is a single-uop instruction with 1-cycle latency. On Silvermont and first-gen Core 2 it's not so good. But overall I'd recommend pshufb + pslld + por to calculate a control-vector for another pshufb if AVX isn't available.

An extra shuffle to prepare for a shuffle is far worse than just using vpermilps on any CPU that supports AVX.


Footnote 1:

You'd have to use a switch or something to select a code path with the right compile-time-constant integer, and that's horrible; only consider that if you don't even have SSSE3 available. It may be worse than scalar unless the jump-table branch predicts perfectly.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
5

Although Peter Cordes is correct in saying that the AVX instruction vpermilps and its intrinsic _mm_permutevar_ps() will probably do the job, if you're working on machines older than Sandy Bridge, an SSE4.1 variant using pshufb works quite well too.

AVX variant

Credits to @PeterCordes

#include <stdio.h>
#include <immintrin.h>


__m128i vperm(__m128i a, __m128i idx){
    return _mm_castps_si128(_mm_permutevar_ps(_mm_castsi128_ps(a), idx));
}


int main(int argc, char* argv[]){
    __m128i a   = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
    __m128i idx = _mm_set_epi32(1,0,3,2);
    __m128i shu = vperm(a, idx);
    printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
                                    ((unsigned*)(&shu))[2],
                                    ((unsigned*)(&shu))[1],
                                    ((unsigned*)(&shu))[0]);
    return 0;
}

SSE4.1 variant

#include <stdio.h>
#include <immintrin.h>


__m128i vperm(__m128i a, __m128i idx){
    idx = _mm_and_si128  (idx, _mm_set1_epi32(0x00000003));
    idx = _mm_mullo_epi32(idx, _mm_set1_epi32(0x04040404));
    idx = _mm_or_si128   (idx, _mm_set1_epi32(0x03020100));
    return _mm_shuffle_epi8(a, idx);
}


int main(int argc, char* argv[]){
    __m128i a   = _mm_set_epi32(0xDEAD, 0xBEEF, 0xCAFE, 0x0000);
    __m128i idx = _mm_set_epi32(1,0,3,2);
    __m128i shu = vperm(a, idx);
    printf("%04x %04x %04x %04x\n", ((unsigned*)(&shu))[3],
                                    ((unsigned*)(&shu))[2],
                                    ((unsigned*)(&shu))[1],
                                    ((unsigned*)(&shu))[0]);
    return 0;
}

This compiles down to the crisp

0000000000400550 <vperm>:
  400550:       c5 f1 db 0d b8 00 00 00         vpand  0xb8(%rip),%xmm1,%xmm1        # 400610 <_IO_stdin_used+0x20>
  400558:       c4 e2 71 40 0d bf 00 00 00      vpmulld 0xbf(%rip),%xmm1,%xmm1        # 400620 <_IO_stdin_used+0x30>
  400561:       c5 f1 eb 0d c7 00 00 00         vpor   0xc7(%rip),%xmm1,%xmm1        # 400630 <_IO_stdin_used+0x40>
  400569:       c4 e2 79 00 c1                  vpshufb %xmm1,%xmm0,%xmm0
  40056e:       c3                              retq

The AND-masking is optional if you can guarantee that the control indices will always be the 32-bit integers 0, 1, 2 or 3.

Iwillnotexist Idonotexist
  • 13,297
  • 4
  • 43
  • 66
  • Thanks for this. Haswell is the oldest architecture I need to support. Do you think `vpermilps` is better there? – John Zwinck May 08 '19 at 04:43
  • 1
    `_mm_mullo_epi32` is SSE4.1 `pmulld`. It costs 2 uops on Haswell and later. (Not a big deal if you're doing CPU detection and using an AVX version on CPUs that support AVX; they won't use this version because `vpermilps` on integer data is better.) @JohnZwinck: **do *not* use this if AVX is actually available, only as a fallback for SSE4.1!** (So you should probably look at code-gen without AVX, so you see any necessary `movdqa` instructions). – Peter Cordes May 08 '19 at 04:43
  • @PeterCordes Okay, your AVX idea is there and credited to yourself. – Iwillnotexist Idonotexist May 08 '19 at 04:48
  • Hmm, depending on what the surrounding code bottlenecks on, also worth considering using `pshufb` to broadcast bytes within dwords, but then you need a shift as well as an OR. Nice use of OR though to create the consecutive byte masks, I was thinking it would be more costly. – Peter Cordes May 08 '19 at 04:49
  • @PeterCordes The shuffle unit is going to become a throughput bottleneck if you also use it to splat the bytes, since often you can only run one shuffle / cc. – Iwillnotexist Idonotexist May 08 '19 at 04:52
  • On Haswell and Broadwell, `pmulld` is 2p0 so it's the bottleneck that limits this to 2c throughput. (I said depending on the surrounding code because even on SKL, maybe the surrounding code is bottlenecked on port 0 / 1, and is using this as part of a larger loop.) More importantly, Intel from Nehalem to IvyBridge had 2-per-clock `pshufb`, and we only want to use this on old CPUs. (Because integer shuffles were only up to 128 bits wide; only FP shuffles were limited to 1c). And `pshufb` is lower latency at 1c than `pmulld` at 10c (HSW), 6c (NHM), or 5c (SnB single uop). – Peter Cordes May 08 '19 at 04:58
  • On SKL, `pmulld` is 10c latency / 1c throughput, running on 2p01. 16-bit multiplies are 1 uop (because I guess they can get that done with separate parts of the FP mantissa multipliers for that element?) but IDK if we can get the job done with a 16-bit low or high half multiply. Or with `pmaddwd`. Looks like no because `3 * 0xFFFF = 0x2FFFD` still doesn't set any bits in the high byte of the 32-bit result. – Peter Cordes May 08 '19 at 05:06
  • @PeterCordes SnB+ has AVX, so it would use `vpermilps` and eat two latency-padding but throughput-agnostic bypass penalties. I was thinking more of my old SSE4.1 C2D as a target for this code: It has 1 `pshufb` and `pmulld`/cc and the multiply has 3cc latency. – Iwillnotexist Idonotexist May 08 '19 at 05:06
  • 1
    According to [Agner Fog's results](https://agner.org/optimize/), Penryn/Wolfdale has 2c `pmulld` throughput, and 5c latency. (It's 4 uops, vs 2 on NHM). You're thinking ot `pmuludq`, the SSE2 widening multiply. (`_mm_mul_epu32`, not `mullo`). And BTW, `vpermilps` won't have any latency penalties on SnB-family. Shuffles don't have any vec-int / fp bypass-delay latency, this is why it's so good! (And maybe why Intel neglected to add a 128-bit integer version in AVX2 or AVX512?) Some instructions still do, like FP blends. – Peter Cordes May 08 '19 at 05:11
  • Another use-case for your answer (for some readers) might be code that just uses SSE4.1 as a baseline without CPU detection, so runs this everywhere. That's why I brought up performance on Haswell and Skylake. But anyway, on all the non-AVX CPUs, `pmulld` is multiple uops and 2c latency or worse, so I think that swings the balance in favour of just `pshufb` and a left-shift + OR. That has the advantage of only requiring SSSE3, not SSE4.1. (Although first-gen Core2 (conroe/merom) without SSE4.1 has slow `pshufb` and other shuffles on its narrow 64-bit shuffle unit...) – Peter Cordes May 08 '19 at 05:16
  • @PeterCordes The Atoms as well have a “slow” `pshufb` and FFmpeg special-cases them for that reason, IIRC – Iwillnotexist Idonotexist May 08 '19 at 05:18
  • These days the more likely non-AVX CPUs are probably Silvermont/Goldmont. But yeah some Core2/NHM systems are still around. AMD CPUs without AVX also lack SSSE3. They went straight from SSE3 (plus SSE4a) to AVX. [Most recent processor without support of SSSE3 instructions?](//stackoverflow.com/q/52858556) – Peter Cordes May 08 '19 at 05:21
  • Yes, in-order Atom, and Silvermont (including KNL), have only 64-bit `pshufb` units, so like Conroe MMX `pshufb` is single-uop but XMM `pshufb` is 4. Goldmont has single-uop fast `pshufb` (but interestingly 1c throughput vs. 0.5c for the MMX version). But anyway, Silvermont's `pmulld` is 7 uops, 11c throughput (and both run on FP0), so we're probably still much better off avoiding it. – Peter Cordes May 08 '19 at 05:25
  • @PeterCordes Could we substitute `pmulld` by some other creative product instruction? – Iwillnotexist Idonotexist May 08 '19 at 05:32
  • If tuning for Silvermont, remember that the front-end can easily be a bottleneck, although multi-uop instructions stall decode so that's an argument against `pshufb`. But it means using multiple other instructions is going to add up fast with only a 2-wide superscalar pipeline, and in-order execution of vector instructions. And no mov-elimination, so we'll often need `movdqa` instructions. Oh, what if we `packusdw` (1c) down to 16-bit elements, then `pmullw` (1 uop / 5c latency / 2c throughput) to duplicate, then `punpcklwd` (1c) back up to dword for an OR. Worse latency than 1x pshufb... – Peter Cordes May 08 '19 at 05:40
  • Yeah, without OoO exec for vector instructions, that's probably not going to be great unless part of something larger that doesn't bottleneck on it (e.g. uops for the other FP pipe). Agner says for Silvermont: *Instructions on floating point and vector registers cannot execute out of order with other instructions going to the same execution port and pipeline.* I don't think we can do any better than pack/unpack around `pmullw`, because like I said earlier, 16-bit multiplies even `pmaddwd` can't broadcast an integer to all 4 bytes. And 32=>64 widening only reads 2 of the 4 dword inputs. – Peter Cordes May 08 '19 at 05:45