Shifting 4 integers right by different values SIMD

Question

SSE does not provide a way of shifting packed integers by a variable amount (I can use any instructions AVX and older). You can only do uniform shifts. The result I'm trying to achieve for each integer in the vector is this.

i[0] = i[0] & 0b111111;
i[1] = (i[1]>>6) & 0b111111;
i[2] = (i[2]>>12) & 0b111111;
i[3] = (i[3]>>18) & 0b111111;

Essentially trying to isolate a different group of 6 bits in each integer.

So what is the optimal solution?

Things I thought about: You can simulate a variable right shift, with a variable left shift and a uniform right shift. I thought about multiplying the packed integers by a different amount each (therefore simulating a left shift). Then with the result, you can do a uniform right shift get the answer. The problem with this the specific op I would use for the multiplication would be _mm_mullo_epi32, which has disappointing latency (10 cycles for haswell), and given my program it would have to wait for the result cause this particular result is a dependency for the next instructions. Overall that method I think would only be a little faster than the brute force method, which is unpack, shift using scalar instructions, and then repack the vector, which would take about 20 cycles I think.

Peter Cordes · Accepted Answer · 2019-08-12T00:16:09.547

If AVX2 is available, this only takes a single efficient instruction. e.g. __m128i _mm_srlv_epi32 (__m128i a, __m128i count) (vpsrlvd), and the 256-bit versions. Variable-shifts of 32-bit and 64-bit elements by corresponding count elements are available in left, right-arithmetic, and right-logical. (Arithmetic right-shift isn't available with 64-bit element size.)

AVX512BW adds 16-bit variable-shifts.

AVX512VBMI has vpmultishiftqb bitfield-extract within each qword. There's an example of using it for unpacking 8 nibbles into 8 bytes for int->hex. For this you'd follow it with an AND mask because it grabs data in 8-bit chunks (but from source positions that don't have to be aligned to byte boundaries).

Emulating it without AVX2:

What kind of dependency chain is this part of? Can you unroll and interleave so two vectors are in flight at once? Two long dep chains in parallel is much better than one long dep chain, if it's so long that the out-of-order window can't see next dep chain in the next loop iteration.

It could be worth making a separate AVX2 version of your function for use on Haswell and later CPUs (where you can use a variable-shift). If you do that, your function will only use pmulld (mullo_epi32) on CPUs where it's most efficient. (i.e. you avoid SSE4.1 mullo_epi32 on AVX2 CPUs, because it turns out those CPUs made that instruction slower.)

pmulld looks like the best we can do for throughput and fused-domain uop count even on Haswell.

On SnB/IvB where it's a single uop for the vector-integer multiply unit, the whole function is only 2 uops / 6 cycle latency / one per 1c throughput. (Which is worse than I managed with shift/blend, so you'd only want to use pmulld if throughput/code-size matters at all, and you aren't bottlenecked purely on latency. e.g. after unrolling.)

If your shift-counts are constants, and you have spare bits at the top of your register, you can multiply by powers of 2 and then use a fixed right-shift. Shift right every DW in a __m128i by a different amount. Knocking off high bits isn't a problem for your bit-field extract because you're ANDing to keep only the low few bits anyway.

// See the godbolt link below for a version of this with more comments
// SnB/IvB: 6c latency, 2 fused-domain uops.
__m128i isolate_successive_6bits_mul (__m128i input)
{
  // We can avoid the AND if we shift the elements all the way to the left to knock off the high garbage bits.
  // 32 - 6 - 18 = 8 extra bits to left shift
    __m128i mul_constant = _mm_set_epi32(1<<(0+8), 1<<(6+8), 1<<(12+8), 1<<(18+8));
    __m128i left_vshift = _mm_mullo_epi32(input, mul_constant);
    __m128i rightshifted = _mm_srli_epi32(left_vshift, (18+8));
    return rightshifted;
}

A smart way with blends:

(Unfortunately we don't have AVX2 vpblendd for efficient dword blends that can run on any port. pblendw is limited to port 5 on Intel CPUs. blendps could be good for throughput (runs on any port) but can introduce bypass delays between integer instructions.)

Shift and blend so each element ends up with the right total shift count.

AND-mask the low 6 bits after combining everything into one vector.

Same latency as the brute-force way (see below) on Intel CPUs, and better throughput (because of fewer uops). Only two immediate-blends tying up port5 is nice. (AVX2 vpblendd can run on any port, but then we'd just use vpsrlvd.)

// seems to be optimal for Intel CPUs.
__m128i isolate_successive_6bits (__m128i input)
{ // input =   [ D      C      B     A ]
  // output =  [ D>>18  C>>12  B>>6  A ] & set1(0b111111)
    __m128i right12 = _mm_srli_epi32(input, 12);
    __m128i merged = _mm_blend_epi16(input, right12, 0xF0);  // copy upper half, like `movhps` (but don't use that because of extra bypass delay)
    // merged = [ D>>12  C>>12  B>>0  A>>0 ]
    __m128i right6 = _mm_srli_epi32(merged, 6);
    merged = _mm_blend_epi16(merged, right6, 0b11001100);    // blend in the odd elements
    // merged = [ D>>(12+6)  C>>12  B>>(0+6)  A>>0 ]        
    return _mm_and_si128(merged, _mm_set1_epi32(0b111111));  // keep only the low 6 bits
}

I put both versions on the Godbolt compiler explorer.

This version is only 5 uops, compiled with gcc 5.3 -O3 -march=ivybridge:

    # input in xmm0, result in xmm0
isolate_successive_6bits:
    vpsrld          xmm1, xmm0, 12                # starts on cycle 0, result ready for the start of cycle 1
    vpblendw        xmm0, xmm0, xmm1, 240         # cycle 1
    vpsrld          xmm1, xmm0, 6                 # cycle 2
    vpblendw        xmm0, xmm0, xmm1, 204         # cycle 3
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip] # cycle 4, result ready on cycle 5
    ret

Every instruction is dependent on the previous, so it has 5c latency. SnB/IvB/HSW/BDW CPUs only have one shift port, so they can't take advantage of the parallelism available in the more brute-force version (which does three shifts with different shift counts). Skylake can, but then the two cycles of blending afterwards eat up the improvement.

The "brute force" way:

Do three shifts the three different shift counts, and use three immediate blends (pblendw) to combine the four vectors into one that has each desired element.

// same latency as the previous version on Skylake
// slower on previous Intel SnB-family CPUs.
isolate_successive_6bits_parallel:
    vpsrld          xmm1, xmm0, 6            # cycle 0.   SKL: c0
    vpsrld          xmm2, xmm0, 12           # cycle 1 (resource conflict on pre-Skylake).  SKL: c0
    vpblendw        xmm1, xmm0, xmm1, 12     # cycle 2 (input dep).  SKL: c1
    vpsrld          xmm3, xmm0, 18           # cycle 2.  SKL: c1
    vpblendw        xmm0, xmm2, xmm3, 192    # cycle 3 (input dep). SKL: c2
    vpblendw        xmm0, xmm1, xmm0, 240    # cycle 4 (input dep). SKL: c3
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip]  # cycle 5 (input dep). SKL: c4.
    ret

Doing the merging with a linear dependency chain instead of a tree means merging can finish sooner after the last shift result is ready:

isolate_successive_6bits_parallel2:
    vpsrld          xmm1, xmm0, 6          # c0.  SKL:c0
    vpsrld          xmm2, xmm0, 12         # c1.  SKL:c0
    vpblendw        xmm1, xmm0, xmm1, 12   # c2.  SKL:c1
    vpblendw        xmm1, xmm1, xmm2, 48   # c3.  SKL:c2
    vpsrld          xmm0, xmm0, 18         # c2.  SKL:c1
    vpblendw        xmm0, xmm1, xmm0, 192  # c4.  SKL:c3 (dep on xmm1)
    vpand           xmm0, xmm0, XMMWORD PTR .LC0[rip] # c5.  SKL:c4
    ret

Hmm, nope, doesn't help. No gain in latency for SnB to BDW, or for SKL. The first merge can happen after only one shift because the un-shifted input is what we need for one element. If element 0 needed a non-zero shift count, this way would have an advantage for pre-SKL, and maybe a disadvantage for SKL.

the blend method is actually probably the fastest cause although it's 12 ops it may be less than 12 cycles with throughput, more simple than what i thought of too. I thought briefly about this and just kind of forgot about it cause I like to see if the more intricate things are faster i guess — Thomas, Jul 14 '16 at 00:46
@Volatile: 12 uops? The brute-force way I described was only 7 uops (AND after merging), with 6c latency on pre-Skylake. I came up with a better way that's only 5 uops. It has better latency on pre-Skylake (5c), and still 5c latency on SKL. — Peter Cordes, Jul 14 '16 at 01:54
@Volatile: I looked at the mul version: it's only 2 uops on SnB/IvB, and avoids the AND if you take advantage of the shifting to knock off the bits we want to get rid of. Worth considering, esp. if you can make a separate AVX2 version so you don't need to run it on Haswell and later. — Peter Cordes, Jul 14 '16 at 02:27
You're totally right it was 7uops, I realized that too when I went to implement. Your new 5uops way is very clever, I like it. Although the mul version is 2 ops, I don't think I care as much for throughput/codesize than I do for latency right now. Very well thought-out answer. I might consider AVX2 in future, but to my target audience which is basically steam users http://store.steampowered.com/hwsurvey). Most of em have AVX, but since AVX512 came out recently I'm guessing most don't have it (steam survey does not a flag for avx512 support unfortunately). — Thomas, Jul 14 '16 at 03:39
@Volatile: If you have to do runtime CPU dispatching anyway (for users without AVX), there's no extra overhead for also having an AVX2 version. (Of course it's more code that you have to debug). re: AVX512: Nobody has AVX512 on their desktop CPU. The only hardware with it ATM is 2nd-gen Xeon Phi cards (Knight's Landing). The first "regular" CPUs with it will be Skylake-E (many-core Xeon, not the current quad-core Xeons that use the same silicon as regular Skylake). — Peter Cordes, Jul 14 '16 at 03:43
@PeterCordes The Knights Landing chips may be "out", but interestingly I can't find any place to buy one aside from Intel's Ninja developer platform. IOW, probably not yet available to consumers. — Mysticial, Jul 15 '16 at 19:12
@Volatile I'm not sure how relevant this is since the population of hardware enthusiasts is very low. But a good number of people (namely overclockers) explicitly disable AVX for thermal reasons. This could become more prevalent with AVX512 if the power-gap between no-AVX and AVX512 widens. — Mysticial, Jul 15 '16 at 19:27
@Mysticial: Does this actually improve FPS in games? Or is this just a GHz bragging-contest thing? I guess most games will still have well-optimized SSE4 versions of things, but this may change in the future and CPUs without AVX might only get a not-very-good fallback. It really annoys me that even Skylake Celeron/Pentium CPUs don't have AVX or BMI enabled (probably they disable VEX decoding). Intel is doing a terrible job at making new extensions ubiquitous in the future. (Also, thanks for the info on KNL. I wasn't planning to buy one, so hadn't kept track of its status.) — Peter Cordes, Jul 15 '16 at 19:36
@PeterCordes I don't know actually since I'm not one of them. But based on what I'm hearing, AVX is still a little-used extension in games since not everybody has it. So there could be a bit of a cat-and-mouse thing going on where game developers (who don't want to multi-version) will skip AVX because not everybody has it, and people don't have AVX because they either disable to get 10-20% higher overclock or like you said, Intel disables them in the lower end chips. — Mysticial, Jul 15 '16 at 19:41
On the topic of Intel disabling AVX, I don't see that coming to an end. The AVX units probably take up a sizable portion of the chip's area. So there's going to be a lot of chips that have defects in the AVX units but not elsewhere. And economics dictates that Intel tries to profit from those as well. — Mysticial, Jul 15 '16 at 19:43
@Mysticial: Yeah, silicon defects in the AVX units is what I suspected too, as the motivation. It doesn't make it any less annoying, especially for BMI instructions like `andn` that aren't worth runtime dispatching for, and are mostly [only useful if they're part of the baseline](http://www.realworldtech.com/forum/?threadid=152533&curpostid=152696). (Linus Torvalds agreed with me :) — Peter Cordes, Jul 15 '16 at 20:01
Can Intel (or anyone) really detect, disable and bin chips with errors in logic portions (eg portions of the chip implementing AVX?). Well really I'm asking about the "detect" part. I know such detection is common for cache slices, but those can be exhaustively tested because their logical structure is simple. Exhaustively testing execution units, however, runs into an combinatorial explosion of possibilities - even without SIMD: try the testing even a 32x32 to 64bit multiply for example. Perhaps you can rely on implementation regularies avoid explosion to, but I dunno (IANAEE). — BeeOnRope, Jul 18 '16 at 21:13
@BeeOnRope: They design tests that cover every transistor, I assume. Not every combination of signal on distant transistors. This is probably made easier by testing via internal debug pins (so they can test all duplicated functionality in execution units on different ports.) — Peter Cordes, Jul 18 '16 at 22:14
@PeterCordes You made the statement, "Can you unroll and interleave so two vectors are in flight at once?" What is magical about Two here? Is it the reciprocal throughput of so many arithmetic instructions is 0.5? Extrapolating, were the pipeline heavily laden with 1/3 recip.thruput instructions, would "two" become "three" for that specific case? — Veldaeven, Mar 16 '21 at 17:28
@Veldaeven: Nothing magical, just twice as good, and sometimes a power-of-2 unroll is more convenient to calculate sizes for. It also depends how much ILP the CPU can see within one chain, but if there's none or not much, you want more dep chains than CPU parallelism so scheduling imperfections don't lose cycles. See [Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)](//stackoverflow.com/q/45113527) for a case with VFMA... dep chains: more than 8 is better, even though that's the latency-bandwidth product. — Peter Cordes, Mar 16 '21 at 22:53

Shifting 4 integers right by different values SIMD

1 Answers1

Linked