Where is VPERMB in AVX2?

Question

AVX2 has lots of good stuff. For example, it has plenty of instructions which are pretty much strictly more powerful than their precursors. Take VPERMD: it allows you to totally arbitrarily broadcast/shuffle/permute from one 256-bit long vector of 32-bit values into another, with the permutation selectable at runtime¹. Functionally, that obsoletes a whole slew of existing old unpack, broadcast, permute, shuffle and shift instructions³.

Cool beans.

So where is VPERMB? I.e., the same instruction, but working on byte-sized elements. Or, for that matter, where is VPERMW, for 16-bit elements? Having dabbled in x86 assembly for some time, it is pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time. It can do any possible permutation, broadcast or byte-wise shuffle. Furthermore, it can also be used to do 16 parallel 4-bit -> 8-bit table lookups².

Unfortunately, PSHUFB wasn't extended to be cross-lane in AVX2, so it is restricted to within-lane behavior. The VPERM instructions are able to do cross shuffle (in fact, "perm" and "shuf" seem to be synonyms in the instruction mnemonics?) - but the 8 and 16-bit versions were omitted?

There doesn't even seem to be a good way to emulate this instruction, whereas you can easily emulate the larger-width shuffles with smaller-width ones (often, it's even free: you just need a different mask).

I have no doubt that Intel is aware of the wide and heavy use of PSHUFB, so the question naturally arises as to why the byte variant was omitted in AVX2. Is the operation intrinsically harder to implement in hardware? Are there encoding restrictions forcing its omission?

¹By selectable at runtime, I mean that the mask that defines the shuffling behavior comes from a register. This makes the instruction an order of magnitude more flexible than the earlier variants that take an immediate shuffle mask, in the same way that add is more flexible than inc or a variable shift is more flexible than an immediate shift.

²Or 32 such lookups in AVX2.

³The older instructions are occasionally useful if they have a shorter encoding, or avoid loading a mask from memory, but functionally they are superseded.

s/babble/dabble/? Also, a good term for "selectable at runtime" is "variable shuffle". The variable-shift instructions (like [`vpsrlvd`](http://www.felixcloutier.com/x86/VPSRLVD:VPSRLVQ.html)) already use this terminology. — Peter Cordes, Jun 23 '16 at 02:07
Yes, *dabble*, although babble makes sense too from time to time. I'm not sure about "variable". I see the shift thing as being almost orthogonal to the "immediate or not" issue. The issue was you couldn't issue different shift counts for different vector elements. A bit like if vector `add` only allowed adding a single value to all elements. That's distinct from whether the argument can only be specified as immediate. Granted the shift is a bit special because most don't even have that issue, bit it's my impression of what Intel means by "variable" there. — BeeOnRope, Jun 23 '16 at 02:32
Oh, good point, you already could have the shift-count for all elements in the low64 of an xmm reg. Still I think "variable shuffle" is immediately obvious without explanation, given a bit of context. — Peter Cordes, Jun 23 '16 at 02:34
Btw, you can emulate `vpermi2b` with about 11 instructions (13 uops). This drops to 5 instructions (7 uops) if you can preprocess the permute vector to be used many times. — Mysticial, Dec 20 '17 at 05:49

score 11 · Accepted Answer · edited May 23 '17 at 12:17

11

I'm 99% sure the main factor is transistor cost of implementation. It would clearly be very useful, and the only reason it doesn't exist is that the implementation cost must outweigh the significant benefit.

Coding space issues are unlikely; the VEX coding space provides a LOT of room. Like, really a lot, since the field that represents combinations of prefixes isn't a bit-field, it's an integer with most of the values unused.

They decided to implement it for AVX512VBMI, though, with larger element sizes available in AVX512BW and AVX512F. Maybe they realized how much it sucked to not have this, and decided to do it anyway. AVX512F takes a lot of die area / transistors to implement, so much that Intel decided not to even implement it in retail desktop CPUs for a couple generations.

(Part of that is that I think these days a lot of code that can take advantage of brand new instruction sets is written to run on known servers, instead of runtime dispatching for use on client machines).

According to Wikipedia, AVX512VBMI isn't coming until Cannonlake, but then we will have vpermi2b, which does 64 parallel table lookups from a 128B table (2 zmm vectors)). Skylake Xeon will only bring vpermi2w and larger element sizes (AVX512F + AVX512BW).

I'm pretty sure that thirty two 32:1 muxers are a lot more expensive than eight 8:1 muxers, even if the 8:1 muxers are 4x wider. They could implement it with multiple stages of shuffling (rather than a single 32:1 stage), since lane-crossing shuffles get a 3-cycle time budget to get their work done. But still a lot of transistors.

I'd love to see a less hand-wavy answer from someone with hardware design experience. I built a digital timer from TTL counter chips on a breadboard once (and IIRC, read out the counter from BASIC on a TI-99/4A which was very obsolete even ~20 years ago whe), but that's about it.

It's pretty clear that the SSE PSHUFB instruction is pretty much among the most useful instructions of all time.

Yup. It was the first variable-shuffle, with a control mask from a register instead of an immediate. Looking up a shuffle mask from a LUT of shuffle masks based on a pcmpeqb / pmovmskb result can do some crazy powerful things. @stgatilov's IPv4 dotted-quad -> int converter is one of my favourite examples of awesome SIMD tricks.

edited May 23 '17 at 12:17

Community

1
1

answered Jun 23 '16 at 02:05

Peter Cordes

328,167
45
605
847

Half way through your answer, I was gonna say, "OK, sure - but what would a hardware guy say?" - but then that's what you said :). I imagine the muxes take a lot of transistors, but since PSHUFB is already there you in fact already have 32x 16:1 muxers. So it seems like doing the additional one bit lookup and blend should be fairly easy in the 3 cycle budget. Most of the cross lane machinery is likely also there because of the existing cross lane shuffles. My real world hardware design experience is about on par with yours though. – BeeOnRope Jun 23 '16 at 02:18
1

About vpermi2b, I at one time thought this awesome instruction was coming soon in AVX512F, but now it seems like we don't get it until AVX512VBMI. No idea when that extension is coming. Some of the weaker forms do show up in F and BW extensions though. – BeeOnRope Jun 23 '16 at 02:42
@BeeOnRope: dammit, I missed that fact, too. Wikipedia says planned for Cannonlake. I thought it was in AVX512BW, coming in Skylake Purley. But you're right, and even `vpermb` is AVX512VBMI. I'm still disappointed that Xeon-branded SKL cores don't have AVX512, which is what I was hoping for. It's only the much more expensive Xeons that will support it, which sucks for an affordable home desktop :( – Peter Cordes Jun 23 '16 at 03:04
2

Indeed. I held out on a laptop until I could get a Skylake one. Only to find out it was only "Xeons" that would get AVX512, which later turned out to mean "server parts" as you point out (making the Xeon branding confusing). The intrinsics guide is confusing on this count too, showing vpermi2b associated with a ton of intrinsics, many coming before VBMI. The good one is in VBMI though :( – BeeOnRope Jun 23 '16 at 03:07
@BeeOnRope: The intrinsics guide looks correct to me. There are three intrinsics for each vector length: one for no masking, one for merge-masking, and one for zero-masking. This is the normal pattern, with the non-512 vector lengths requiring both AVX512VBMI and AVX512VL like all 128/256 EVEX instructions. (There's a 4th intrsinsic for `vpermt2b`, since the two forms of the instruction clobber a different operand; either the index (vpermi2b) or one of the table inputs (vpermt2b), so with merge-masking the choice affects more than reg allocation. Hence mask_, mask2_, maskz_ and non-masked) – Peter Cordes Jun 23 '16 at 10:34
1

@BeeOnRope I year ago I sat down and tried to design a permute unit myself. And no matter what I did, I wasn't able to get a full N-to-N permute to go fewer than `O(N^2*W)` transistors without multi-cycling (W is the # of bits per word). Unless there's a Karatsuba-like algorithm for stacking muxes, permutes are going to be expensive either for area or for latency. `PSHUFB` is 16-to-16 x 8 on SSSE3. On AVX2, it's 16-to-16 x 8 x 2. `vpermb` in AVX512-VBMI is 128-to-64 x 8. You do the math. – Mysticial Jun 23 '16 at 21:39
Also, the complexities that I mentioned are for transistor counts only. I have completely disregarded routing traffic would I imagine would be insane. While I'm not a hardware designer by profession, I've done some of it in school. The hardest part of building a 32-bit Wallace Tree multiplier was the routing traffic rather than the `O(N^2)` transistor count. – Mysticial Jun 23 '16 at 21:42
Right, but you have triple the latency budget for say `VPERMB`. If you needed to do it 1 cycle, the `N^2` would make it twice as complex vs the two 128b shuffles that `PSHUFB` gives you on AVX... _but_ `PSHUFB` takes 1 cycle, while `VPERMB` could have a latency of 3 like `VPERMD` (but throughput of 1 per cycle is key). So it doesn't seem like a big leap to get the full width with _triple_ the cycles (3 cycles). I don't know how it's pipelined, however. – BeeOnRope Jun 23 '16 at 22:06
@BeeOnRope Latency doesn't matter. It's throughput. Sure you're not getting everything through N^2 transistors in 1 cycle. You're gonna split it into multiple stages so it's 3 or 4 cycles. But once data clears one stage, the stage is ready for the next set of data. IOW when you're running at max capacity, you're gonna have N^2 transistors burning every cycle running different parts of different instances of the instruction. Also, chopping off a small constant factor like 3 from N^2 doesn't solve the fundamental problem that it's N^2. (Assuming Intel doesn't have a better algorithm.) – Mysticial Jun 23 '16 at 22:10
1

If you're willing to sacrifice on throughput as well, then you can easily get away with a small number of transistors by multi-cycling them. It's fairly trivial to build an N-to-N permute using `O(N*W)` transistors if you're willing to limit the throughput to 1 instruction every `O(N)` cycles. (IOW, don't vectorize, but do each element sequentially through the same hardware). But surely this isn't what Intel is after. – Mysticial Jun 23 '16 at 22:15
I'm not following you. Let's take at face value that it is `N^2`. Intel is already doing 1-cycle 32 x 16-to-16 byte shuffles with `PSHUFB` in AVX2. So to get to 32 x 32-to-32 shuffles, you are talking a factor of 2^2 = 4. So it's only four times worse. So carving off a factor of 3 or 4, because you have 3x the the latency budget approximately cancels it out. I'm not saying Intel should do N=32 when they were only doing N=2 before, then `N^2` would hurt. I'm asking for a single doubling. FWIW they do get everything through in one-cycle for `PSHUFB`. – BeeOnRope Jun 23 '16 at 23:12
@PeterCordes - you are right about the instrinsics guide. I had though that some instructions were tagged with `vpermi2b` while being shown as for belonging to the earlier (than `VBMI`) subsets, but that doesn't seem to be the case. – BeeOnRope Jun 23 '16 at 23:30
FWIW, AMD's XOP instruction `VPPERM` does a selection across 32 input bytes, from two registers, for 16 input bytes, which is pretty much there, and has done it with a 1 cycle throughput since it was introduced 5 years ago, despite chips that were generally behind Intel in process and performance. So I'm not really buying that the cost is prohibitive. – BeeOnRope Jun 23 '16 at 23:39
4

@BeeOnRope Oh, Intel hardly put much effort at all into AVX2/Haswell. They copy-pasted their execution units and slapped a very light-weight 3-cycle cross-lane permute on it. I doubt they redesigned the entire permute unit. If I had to guess, they spent all their energy on making 3-input uops work (for the FMAs) that they couldn't do anything else. – Mysticial Jun 23 '16 at 23:42
Also, you missed my previous statement that increasing the latency *doesn't* let you reduce the area if you keep the throughput the same. So increasing the latency of the permute from 1 to 3 cycles doesn't cut the area by 3. – Mysticial Jun 23 '16 at 23:44
@BeeOnRope: According to Agner Fog's tables, XOP `vpperm` has 2c latency on all three BD-family uarches. But that's normal: even `pand` is 2c latency. Only a few vector insns have 1c latency, like SSE4A `insertq x,x,i,i`. edit: yeah, this implies that doing the same thing for another 16 selectors in parallel should only be twice as expensive. `vpperm` throughput is still one per 1c. – Peter Cordes Jun 24 '16 at 00:02
I read it, and you right my latency comments don't make sense in that context. I think it is in fact only a factor of 2 though, not 4, for the full shuffle. I.e., 32x 16-way shuffle is 2x the complexity of 32x 32-way shuffle, if I understand your formula (a 16x 16way shuffle would be 1/4th the complexity). – BeeOnRope Jun 24 '16 at 00:03
@Mysticial: so you figure AVX2 could have included a `VPERMB` for a reasonable cost? That's disappointing. I guess they did dword and qword for the benefit of float/double, and vectorizing stuff with int arrays. It makes sense that `vpermi2b` is very expensive – Peter Cordes Jun 24 '16 at 00:03
@PeterCordes - right, but it has 1 cycle throughput. I should say 1 cycle inverse throughput, really. That's the more interesting metric here I think, especially since Mystical explained that throughput is the main knob you can turn to reduce the design cost (latency just falls out separately from the delay of the design). As you point out 2 latency is normal on AMD, so you have to pay a bit more attention to the dependency chains. – BeeOnRope Jun 24 '16 at 00:06
1

@PeterCordes I estimate a factor of 2 for the permute unit in transistors. The data routing could be more than 2x. I haven't done the complexity analysis on routing area. Given that permute is probably most of port 5, I can see how they would pass on it. – Mysticial Jun 24 '16 at 00:52
One interesting note is that they used to have to 2 shuffle units in Sandy Bridge (?). Do you could to two PSHUFBs a cycle. Then they dropped down to one and it never came back. So that's some evidence for these units being heavy. – BeeOnRope Jun 24 '16 at 04:47
1

@BeeOnRope: yeah, that's my favourite part of SnB. I assumed the dropped it for Haswell because two copies of the full 256b shuffle unit would be too expensive, and having different throughputs for 128b versions of shuffles that are available in 256b form would be an actual problem to implement. (e.g. SnB/IvB only run `shufps` on p5 (since it has an AVX version), but `pshufb` on p15 (since SnB doesn't do AVX2).) IDK if they could have implemented some cheaper shuffles on a more-limited shuffle unit on P1 (e.g. just xmm/ymm `unpack`/`punpck`), but probably design time/complexity is a problem – Peter Cordes Jun 24 '16 at 04:57
2

By moving all the permutes to port 5, they eliminated the need for any cross-lane routing from the other ports. As far as I can tell, on Haswell, every single vector instruction that crosses a 64-bit boundary is on port 5. From that, we can deduce that, port 5 is the *only* port (aside from the load/store units) that has routing across 64-bit lanes. From a processor design standpoint that's a pretty big deal. I emphasize routing a lot because it really takes up that much space. And while routing is stackable (3D) in a chip, they still only have a handful of layers. – Mysticial Jun 24 '16 at 23:01
2

@BeeOnRope An update now that AVX512 hardware is out. `vpermw` is only half the throughput of the 32 and 64-bit granular permutes. The latency is also 7 cycles as opposed to 3. Which means it's being double-cycled through a 32-bit permute unit (3 cycles each) then passed through a MUX to select the right outputs (1 cycle). So don't be surprised if `vpermb` on Cannonlake is either double-cycled through a native 16-bit permute, or quad-cycled through the same 32-bit permute that Skylake X has. Though given the die-shrink, I wouldn't rule out a single-cycle throughput `vpermb`. – Mysticial Dec 15 '17 at 23:13
2

Another update: The permute unit for Cannon Lake looks like a native all-to-all 64-way byte permute. `vpermw` and `vpermb` are single-cycle throughput. But the 2-vector `vperm2b` is 2-cycle throughput. So not the *best* possible scenario, but pretty close to it. /cc @BeeOnRope – Mysticial Jul 18 '18 at 17:34
@Mysticial - time to start looking for a new laptop :). Much better than I had expected given other earlier assessments... – BeeOnRope Jul 18 '18 at 18:13
1

@BeeOnRope Don't hold your breath though. These are low-volume, low-end chips only sold in China atm. Don't expect to see them in volume until probably H2 2019 or 2020. Intel still can't get their 10nm yields up and there's no end in sight. – Mysticial Jul 18 '18 at 18:58
@Mysticial - yeah, I can wait. – BeeOnRope Jul 18 '18 at 19:17
1

@mystical - is there a latency/throughout dump released somewhere? I couldnt find it in the usual places. – BeeOnRope Jul 20 '18 at 15:10

Where is VPERMB in AVX2?

1 Answers1

Linked

Related