48

If you have an input array, and an output array, but you only want to write those elements which pass a certain condition, what would be the most efficient way to do this in AVX2?

I've seen in SSE where it was done like this: (From:https://deplinenoise.files.wordpress.com/2015/03/gdc2015_afredriksson_simd.pdf)

__m128i LeftPack_SSSE3(__m128 mask, __m128 val)
{
 // Move 4 sign bits of mask to 4-bit integer value.
 int mask = _mm_movemask_ps(mask);
 // Select shuffle control data
 __m128i shuf_ctrl = _mm_load_si128(&shufmasks[mask]);
 // Permute to move valid values to front of SIMD register
 __m128i packed = _mm_shuffle_epi8(_mm_castps_si128(val), shuf_ctrl);
 return packed;
}

This seems fine for SSE which is 4 wide, and thus only needs a 16 entry LUT, but for AVX which is 8 wide, the LUT becomes quite large(256 entries, each 32 bytes, or 8k).

I'm surprised that AVX doesn't appear to have an instruction for simplifying this process, such as a masked store with packing.

I think with some bit shuffling to count the # of sign bits set to the left you could generate the necessary permutation table, and then call _mm256_permutevar8x32_ps. But this is also quite a few instructions I think..

Does anyone know of any tricks to do this with AVX2? Or what is the most efficient method?

Here is an illustration of the Left Packing Problem from the above document:

Left.Packing.Problem

Thanks

Froglegs
  • 1,095
  • 1
  • 11
  • 21
  • 1
    You could use [VGATHERDPS](http://www.felixcloutier.com/x86/VGATHERDPS:VGATHERQPS.html) under the assumption that the src is in memory. Before that you have to create the appropriate indices from the mask. – zx485 Apr 29 '16 at 08:04
  • 1
    It's worse than you think. The AVX2 256-bit `VPSHUFB` instruction can't move data between the 128-bit vector lanes. You'd need `vpermd` to do that, which would need a second lookup-table. – EOF Apr 29 '16 at 11:12
  • 1
    @EOF: Thanks for this important addition. That [`VPSHUFB`, (scroll down to 'VEX.256 encoded version')](http://www.felixcloutier.com/x86/PSHUFB.html) does not operate on a 256-bit vector but instead operates on two separate 128-bit vectors in a `YMM` is noteworthy. Another **major** inconsistency in the Intel ISA. – zx485 Apr 29 '16 at 13:03
  • 2
    @zx485: I'll have to disagree with you on the "inconsistency". The separate AVX-lanes are actually fairly consistent, with the few instructions that can cross them being explicitly documented. Also, what other ISA even offers 256-bit vectors at all? Yes, there's a price to pay for compatibility, but AVX2 is a really nice vector instruction set. – EOF Apr 29 '16 at 13:06
  • 2
    @EOF: I'll have to disagree with your preceding elaborations, too, but from my/another point of view. Due to _legacy_ over _legacy_, the Intel ISA is highly fragmented. IMHO a thorough cleanup would be beneficial. Intel tried that with IA-64, but in a strange way. Some days ago I read a posting of [_Agner Fog_](https://software.intel.com/en-us/forums/intel-isa-extensions/topic/477541), in which he explains the inconsistencies of the x86/64 architecture proliferated, titled '...a big step forward - but repeating past mistakes!'. – zx485 Apr 29 '16 at 13:23
  • @zx485: AMD cleaned the architecture up a fair bit when they introduced x86-64. They didn't go far enough in my opinion, but there really aren't many cleaner architectures around. MIPS? Horrible multiply/divide until release 6 (which breaks compatibilty...). ARM? Disgusting snarl of encodings and instruction-set rubble (ARM, thumb, thumbEE, jazelle), poorly-designed for out-of-order execution, required much more breaking change to go to 64-bit. Maybe POWER is nicer, but x86 is not a lot worse than what ordinary mortals can buy. – EOF Apr 29 '16 at 13:32
  • You can use AVX2 up until just before you store. But when you store use SSE because the result will be packed in each 128-bit lane but not across lanes. – Z boson Apr 29 '16 at 14:48
  • You could use `_mm256_permutevar8x32_ps` to shuffle across-lanes. – Z boson Apr 29 '16 at 15:08
  • ok i moved it to an answer – Froglegs Apr 30 '16 at 00:57
  • @EOF: x86-64 in general is not bad. For user-space programming, I find it pretty easy (except for all the subtleties of saving instruction bytes, since the most useful/common instructions are *not* always the shortest ones). I have several pet-peeves, though (e.g. setcc r/m8 instead of r/m32). I haven't used other SIMD instruction sets, but SSE2 (esp. without SSE4.1) has so many gaps that it's really annoying. Until AVX2, only one variable-mask shuffle. When you see AVX512, you really start to realize what we've been missing without it. – Peter Cordes Apr 30 '16 at 06:44
  • @Zboson: yeah, the trick is creating a mask for `vpermps`. (i.e. I think my AVX2/BMI2 answer is really cool and I want to ping everyone and tell them about it. :P Hopefully there aren't any serious design-flaw bugs in it.) – Peter Cordes Apr 30 '16 at 06:46
  • 1
    For your SSE version have you consider a instruction LUT instead of a data LUT? I mean a jump table to 16 possible `_mm_shuffle_ps` instructions (actually only 11 are necessary)? – Z boson Apr 30 '16 at 12:01
  • @PeterCordes, your AVX2 + BMI2 answer looks interesting. To you test it to make sure it works? – Z boson Apr 30 '16 at 12:15
  • @Zboson: no, I didn't test it. I don't have a Haswell, and Intel's `sde` emulator doesn't support debugging, IIRC. – Peter Cordes Apr 30 '16 at 13:19
  • I'll test it, I have haswell – Froglegs Apr 30 '16 at 14:32
  • Peter's AVX2 + BMI2 works, test it on haswell – Froglegs Apr 30 '16 at 15:04
  • What a fantastic diagram! – Dave Dopson May 04 '20 at 19:13

6 Answers6

54

AVX2 + BMI2. See my other answer for AVX512. (Update: saved a pdep in 64bit builds.)

We can use AVX2 vpermps (_mm256_permutevar8x32_ps) (or the integer equivalent, vpermd) to do a lane-crossing variable-shuffle.

We can generate masks on the fly, since BMI2 pext (Parallel Bits Extract) provides us with a bitwise version of the operation we need.

Beware that pdep/pext are very slow on AMD CPUs before Zen 3, like 6 uops / 18 cycle latency and throughput on Ryzen Zen 1 and Zen 2. This implementation will perform horribly on those AMD CPUs. For AMD, you might be best with 128-bit vectors using a pshufb or vpermilps LUT, or some of the AVX2 variable-shift suggestions discussed in comments. Especially if your mask input is a vector mask (not an already packed bitmask from memory).

AMD before Zen2 only has 128-bit vector execution units anyway, and 256-bit lane-crossing shuffles are slow. So 128-bit vectors are very attractive for this on Zen 1. But Zen 2 has 256-bit load/store and execution units. (And still slow microcoded pext/pdep.)


For integer vectors with 32-bit or wider elements: Either 1) _mm256_movemask_ps(_mm256_castsi256_ps(compare_mask)).
Or 2) use _mm256_movemask_epi8 and then change the first PDEP constant from 0x0101010101010101 to 0x0F0F0F0F0F0F0F0F to scatter blocks of 4 contiguous bits. Change the multiply by 0xFFU into expanded_mask |= expanded_mask<<4; or expanded_mask *= 0x11; (Not tested). Either way, use the shuffle mask with VPERMD instead of VPERMPS.

For 64-bit integer or double elements, everything still Just Works; The compare-mask just happens to always have pairs of 32-bit elements that are the same, so the resulting shuffle puts both halves of each 64-bit element in the right place. (So you still use VPERMPS or VPERMD, because VPERMPD and VPERMQ are only available with immediate control operands.)

For 16-bit elements, you might be able to adapt this with 128-bit vectors.

For 8-bit elements, see Efficient sse shuffle mask generation for left-packing byte elements for a different trick, storing the result in multiple possibly-overlapping chunks.


The algorithm:

Start with a constant of packed 3 bit indices, with each position holding its own index. i.e. [ 7 6 5 4 3 2 1 0 ] where each element is 3 bits wide. 0b111'110'101'...'010'001'000.

Use pext to extract the indices we want into a contiguous sequence at the bottom of an integer register. e.g. if we want indices 0 and 2, our control-mask for pext should be 0b000'...'111'000'111. pext will grab the 010 and 000 index groups that line up with the 1 bits in the selector. The selected groups are packed into the low bits of the output, so the output will be 0b000'...'010'000. (i.e. [ ... 2 0 ])

See the commented code for how to generate the 0b111000111 input for pext from the input vector mask.

Now we're in the same boat as the compressed-LUT: unpack up to 8 packed indices.

By the time you put all the pieces together, there are three total pext/pdeps. I worked backwards from what I wanted, so it's probably easiest to understand it in that direction, too. (i.e. start with the shuffle line, and work backward from there.)

We can simplify the unpacking if we work with indices one per byte instead of in packed 3-bit groups. Since we have 8 indices, this is only possible with 64bit code.

See this and a 32bit-only version on the Godbolt Compiler Explorer. I used #ifdefs so it compiles optimally with -m64 or -m32. gcc wastes some instructions, but clang makes really nice code.

#include <stdint.h>
#include <immintrin.h>

// Uses 64bit pdep / pext to save a step in unpacking.
__m256 compress256(__m256 src, unsigned int mask /* from movmskps */)
{
  uint64_t expanded_mask = _pdep_u64(mask, 0x0101010101010101);  // unpack each bit to a byte
  expanded_mask *= 0xFF;    // mask |= mask<<1 | mask<<2 | ... | mask<<7;
  // ABC... -> AAAAAAAABBBBBBBBCCCCCCCC...: replicate each bit to fill its byte

  const uint64_t identity_indices = 0x0706050403020100;    // the identity shuffle for vpermps, packed to one index per byte
  uint64_t wanted_indices = _pext_u64(identity_indices, expanded_mask);

  __m128i bytevec = _mm_cvtsi64_si128(wanted_indices);
  __m256i shufmask = _mm256_cvtepu8_epi32(bytevec);

  return _mm256_permutevar8x32_ps(src, shufmask);
}

This compiles to code with no loads from memory, only immediate constants. (See the godbolt link for this and the 32bit version).

    # clang 3.7.1 -std=gnu++14 -O3 -march=haswell
    mov     eax, edi                   # just to zero extend: goes away when inlining
    movabs  rcx, 72340172838076673     # The constants are hoisted after inlining into a loop
    pdep    rax, rax, rcx              # ABC       -> 0000000A0000000B....
    imul    rax, rax, 255              # 0000000A0000000B.. -> AAAAAAAABBBBBBBB..
    movabs  rcx, 506097522914230528
    pext    rax, rcx, rax
    vmovq   xmm1, rax
    vpmovzxbd       ymm1, xmm1         # 3c latency since this is lane-crossing
    vpermps ymm0, ymm1, ymm0
    ret

(Later clang compiles like GCC, with mov/shl/sub instead of imul, see below.)

So, according to Agner Fog's numbers and https://uops.info/, this is 6 uops (not counting the constants, or the zero-extending mov that disappears when inlined). On Intel Haswell, it's 16c latency (1 for vmovq, 3 for each pdep/imul/pext / vpmovzx / vpermps). There's no instruction-level parallelism. In a loop where this isn't part of a loop-carried dependency, though, (like the one I included in the Godbolt link), the bottleneck is hopefully just throughput, keeping multiple iterations of this in flight at once.

This can maybe manage a throughput of one per 4 cycles, bottlenecked on port1 for pdep/pext/imul plus popcnt in the loop. Of course, with loads/stores and other loop overhead (including the compare and movmsk), total uop throughput can easily be an issue, too.

e.g. the filter loop in my godbolt link is 14 uops with clang, with -fno-unroll-loops to make it easier to read. It might sustain one iteration per 4c, keeping up with the front-end, if we're lucky.

clang 6 and earlier created a loop-carried dependency with popcnt's false dependency on its output, so it will bottleneck on 3/5ths of the latency of the compress256 function. clang 7.0 and later use xor-zeroing to break the false dependency (instead of just using popcnt edx,edx or something like GCC does :/).

gcc (and later clang) does the multiply by 0xFF with multiple instructions, using a left shift by 8 and a sub, instead of imul by 255. This takes 3 total uops vs. 1 for the front-end, but the latency is only 2 cycles, down from 3. (Haswell handles mov at register-rename stage with zero latency.) Most significantly for this, imul can only run on port 1, competing with pdep/pext/popcnt, so it's probably good to avoid that bottleneck.


Since all hardware that supports AVX2 also supports BMI2, there's probably no point providing a version for AVX2 without BMI2.

If you need to do this in a very long loop, the LUT is probably worth it if the initial cache-misses are amortized over enough iterations with the lower overhead of just unpacking the LUT entry. You still need to movmskps, so you can popcnt the mask and use it as a LUT index, but you save a pdep/imul/pext.

You can unpack LUT entries with the same integer sequence I used, but @Froglegs's set1() / vpsrlvd / vpand is probably better when the LUT entry starts in memory and doesn't need to go into integer registers in the first place. (A 32bit broadcast-load doesn't need an ALU uop on Intel CPUs). However, a variable-shift is 3 uops on Haswell (but only 1 on Skylake).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    I tested it on haswell and it works, nice job! The only issue is that for some reason on MSVC _pdep_u64 and _mm_cvtsi64_si128 are only available if compiling for x64. They get defined out in 32bit builds. – Froglegs Apr 30 '16 at 15:03
  • @Froglegs: woot! I love it when code works the first time after getting it to compile. How did the performance compare with your LUT version? In theory the throughput should be great, but in practice CPUs are complicated and can trip on unexpected bottlenecks that uop-counting doesn't find. And like I said for popcnt, of course this doesn't work in 32bit. If you need a 32bit version, unpack with `vmovd` / `vpbroadcastd` / `vpsrlvd` / `vpand` (i.e. your LUT-unpack code, but starting with an integer register instead of a memory value). – Peter Cordes Apr 30 '16 at 15:30
  • Hey Peter, I put together a little performance test, the performance of your method and my LUT are very similar, with the LUT slightly faster. Your method 64x(_pdep_u64) vs x86(using vector shuffle), the x64 approach is about just a hair faster – Froglegs Apr 30 '16 at 16:31
  • I like your method since it doesn't require wasting L1 memory on a LUT ;) My little performance test wasn't really realistic since all it was doing was calculating shuffle masks, real code I'd probably prefer your method. – Froglegs Apr 30 '16 at 16:32
  • Use this code when targeting x86: __m256i indices = _mm256_set1_epi32(wanted_indices); __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8)); __m256i shufmask = _mm256_srli_epi32(m, 29); – Froglegs Apr 30 '16 at 16:45
  • @Froglegs: good idea using 2 shifts instead of shift+and to knock off the garbage bits. On Haswell, `srli` runs on port0 only, so it doesn't compete with `pext/pdep` or the shuffle. And yeah, `set1` is the right approach to broadcasting into a vector. – Peter Cordes Apr 30 '16 at 16:49
  • @Froglegs: talking about 64bit pdep got me thinking: maybe we can use 64b constants and skip the unpack from 3b to 8b. It worked; I saved an insn. Now the 64bit version should have a more significant advantage over 32bit. I also updated the code in the godbolt link significantly, so it works for `-m32`, using fallback code for that. (If you actually use it that way, now you need to unit-test separately in 32bit and 64bit.) – Peter Cordes Apr 30 '16 at 18:34
  • I tested it out, runs about the same speed, but 1 less instruction is good:) MSVC produces the same output as clang, except it uses mov instead of movabs – Froglegs May 01 '16 at 01:48
  • @Froglegs: It's the same instruction. [The GNU toolchain uses `movabs` for `mov r64, imm64` and `mov [abs64], rax` even in Intel-syntax mode](http://www.x86-64.org/documentation/assembly.html), to distinguish them from `mov r/m64, imm32`, but other assemblers/disassemblers don't. If you're bottlenecked on memory, you won't measure a difference, but it does reduce the latency of the sequence from 19 to 16 cycles on Haswell. (or 18 to 15, depending on how the compiler implements the multiply by 0xFF). And of course fewer insns / uops helps throughput. – Peter Cordes May 01 '16 at 01:53
  • 1
    Congats on getting this right without having the hardware. I am suprised you have not received more than two (from the OP and me) votes. I added an answer using an instruction LUT. What do you think of this solution? Maybe it's a bad idea. – Z boson May 01 '16 at 09:58
  • @Zboson: re: votes. I know, right? The question got a surprising number of votes, so I expected more people would be interested in these answers. – Peter Cordes May 01 '16 at 16:10
  • @PeterCordes, well your answer is I think a bit hard to follow but I think that has more to do with your x86 knowledge being well beyond mine and probably most other people's now than anything else. This makes it harder to parse to determine if it's correct (which is why I asked you if you tested it). But the OP seems happy and that's very important. You can probably pick up an haswell intel NUC for pretty cheap now on ebay or something (broadwell and skylake are exceedingly boring so I will hold off longer). – Z boson May 01 '16 at 18:25
  • If you do understand the data movement in my answer now, can you think of a better way to introduce it? Maybe step by step with ASCII diagrams? How readable do you find the comments in the code? – Peter Cordes May 01 '16 at 19:20
  • @Zboson: I'm shopping for a Skylake system. The built-in GPU is supposed to be significantly better than Haswell. NUCs don't have room for my RAID array :/ Also, 32GB of DDR4 is not very expensive :) I wish they'd make a desktop Skylake with eDRAM, though. Also, my desktop setup needs 2 video outputs (at least one being DVI), and S/PDIF audio output. Also I'd like to be able to throw in a PCIe video card for games if I want. I have a perfectly good Radeon 6870 that idles at about 80 Watts with the open source drivers ... – Peter Cordes May 01 '16 at 19:21
  • The part I do not understand is how _pext_u32/64(identity_indices, expanded_mask) manages to produce the desired indices-- – Froglegs May 02 '16 at 02:29
  • I don't know why more people don't like it, seems pretty useful, but maybe i'm biased-- – Froglegs May 02 '16 at 02:45
  • @Froglegs: Thanks for the feedback on which step wasn't obvious. I had only linked the wikipedia article which had a diagram of pext in operation. Does this update help? I worked through an example of how pext grabs bits and packs them to the low bits of the output. – Peter Cordes May 02 '16 at 02:45
  • Ah I was not looking closely enough, I didn't notice the expanded_mask was the selector, had it reversed in my mind. Makes more sense now--! – Froglegs May 02 '16 at 02:51
  • @Froglegs: yeah, things get tricky when you use one shuffle to create a control-mask for another shuffle. When I first wrote the answer, I just gave up and let the verbose variable-names in the code speak for themselves, because it was getting pretty hairy to explain in words. (Other shuffles where both inputs are variable also get used both ways: `pshufb` can be used as a 4bit LUT with variable shuffle mask and constant data being shuffled. Or it can be used as a shuffle with constant control inputs) – Peter Cordes May 02 '16 at 03:01
  • I find the comments in your assembly code very useful. – Z boson May 02 '16 at 10:47
  • In `filter_non_negative_avx2` and `filter_non_negative_avx512` you `src += 16;`. I think the avx2 version should only add 8. In this special case we could also drop the compare if we don't care about negative zero (movemask extracts the sign bit). Please correct me if I'm wrong. – Christoph Diegelmann Jan 30 '17 at 10:33
  • 1
    @Christoph: Good catch on the loop increment, thanks. I probably copied it from the avx512 version. Also a good point about sign bits. If you're willing to consider `-0.0` as negative, you can even do stuff like XOR floats together and then then use that as the control for a variable blend to do something based on whether their signs are equal or not (e.g. while checking if a point is inside a bounding box), instead of multiplying them or doing two compares. But note that NaNs can have the sign-bit set or not, so this isn't safe wrt to NaNs. – Peter Cordes Jan 30 '17 at 16:01
  • 1
    @PeterCordes , Very nice answer! I think for the Skylake platform a tiny improvement is possible if the `_mm256_movemask_ps` is replaced by `_mm256_movemask_epi8`. Then we don't need the (slow) `pdep` and `imul` instructions, because in that case we already have a 8x4-bit mask which is suitable for the 32 bits `pdep` instruction with `identity_indices = 0x76543210` Instead, we need a `vpsrlvd` and a `vpand` at the end, but these have better latency and throughput on Skylake than `pdep` and `imul`. – wim Feb 02 '17 at 11:00
  • As far as I can see there is no bypass delay for both `vpmovmskb` and `vmovmskps`, at least on Skylake. Another minor advantage is that this code should run in both 32 bit and 64 bit mode. – wim Feb 02 '17 at 11:02
  • Code: `__m256 compress256_v3(__m256 src, __m256 mask256){unsigned int mask = _mm256_movemask_epi8(_mm256_castps_si256(mask256)); const uint32_t identity_indices = 0x76543210; uint32_t wanted_indices = _pext_u32(identity_indices, mask); __m256i fourbitvec = _mm256_set1_epi32(wanted_indices); __m256i shufmask = _mm256_srlv_epi32(fourbitvec,_mm256_set_epi32(28,24,20,16,12,8,4,0)); shufmask = _mm256_and_si256(shufmask,_mm256_set1_epi32(0x0F)); return _mm256_permutevar8x32_ps(src, shufmask);} ` . Godbold [link](https://godbolt.org/g/iUMyyP) – wim Feb 02 '17 at 11:03
  • 1
    @wim Using `_mm256_movemask_epi8` we can get rid of the multiply in @PeterCordes orginal version. First we `_pext_u32` to get `wanted_indices` and then `_pdep_u64` with `0x0F0F0F0F0F0F0F0F` to scale them from nybbles to bytes. Godbold [link](https://godbolt.org/g/Q6rvio) – Christoph Diegelmann Feb 21 '17 at 15:37
  • @Christoph : Your code is nice and a bit shorter, but my code might be slightly faster. Aside from loading the constands, which are likely to be hoisted outside the loop after inling, your code uses `vmovq+vpmovzxbd` instead of `vmovd+vpbroadcastd` in my code, which is equally efficient. However, instead of `pdep` in your code, with latency 3 and throughput 1 on Skylake, my code uses `vpsrlvd+vpand`, which both have latency 1 and throughput 1/2. Moreover, your code only works in 64 bit mode. – wim Feb 22 '17 at 14:37
  • 2
    @Christoph : Correction: On Skylake `vpand` has latency 1 and throughput 1/3. Note that `vpsrlvd` is very slow on Haswell: latency 2 and throughput 2. Therefore, on Haswell your solution will be faster. – wim Feb 22 '17 at 15:34
  • 2
    @wim: AMD's new Zen I think still has 128b vector execution units (so 256b ops have half throughput). Doing more in scalar integer will be a win there, if `pdep` is fast on Zen. (It is supported, but I don't think there are latency numbers yet). I think overall throughput should be more important than latency here, since the loop-carried dependency is only on `popcnt` and its input. Thanks for the `vpmovmskb` idea; I'll update my answer with that sometime. (Or feel free to add a paragraph and a godbolt link to the answer yourself; I might not get back to this very soon). – Peter Cordes Feb 22 '17 at 21:27
  • 2
    @PeterCordes : [This](http://users.atw.hu/instlatx64/AuthenticAMD0800F11_K17_Zen_InstLatX64.txt) webpage lists latency and throughput numbers for the AMD Ryzen/Zen CPU. The numbers are quite interesting. For example: The latency and throughput of the `vpand` instruction with ymm (256 bit) operands is 1c and 0.5c, which is quite amazing for a processor without 256 bit execution units, I think. On the the other hand, the `pext` and `pdep` instructions both have L=18c and T=18c.... The `vpsrlvd` instruction: L=T=4c. – wim Apr 05 '17 at 11:25
  • 1
    Probably they focused mainly on optimizing the most common instructions, when they were developing the micro architecture. – wim Apr 05 '17 at 11:25
  • @wim: Thanks for the link. Yeah, they must be microcoding `pext`/`pdep` instead of including a big hardware execution unit. Looks like Zen has pretty impressive vector throughput for a lot of 256b integer AVX2 instructions. it apparently has two 128b shuffle units, so it can do two PSHUFB xmm per clock or one PSHUFB ymm. It does really well on x265 benchmarks, so apparently the micro-op cache and 6-wide issue is enough to handle splitting 256b ops into two. Unfortunately, it looks much weaker at cross-lane shuffles. `vpermps` is 5c lat and 4c tput, which is bad for this algo. – Peter Cordes Apr 06 '17 at 02:46
  • Zen still has slow `bsr`/`bsf`, but fast `lzcnt`/`tzcnt`. :( And slow `shld`/`shrd`. They did speed up integer multiply, though, to be a lot more like Intel SnB-family (3c latency, 1c throughput for the forms that don't produce a high-half result, even 64-bit. Big speedup over Bulldozer/Piledriver/Steamroller's 6c lat, 4c throughput for 64-bit imul). Overall very impressive (especially given the starting-point), and apparently the cache hierarchy doesn't suck. – Peter Cordes Apr 06 '17 at 02:51
  • Yes, they have made a big step forward! Some instruction even have better timings than on Skylake, – wim Apr 06 '17 at 09:38
  • 1
    Due to the very slow `pdep`and `pext` on ryzen (18 cycles each) @Froglegs answer is probably the better option if we're unsure about the target CPU. Maybe we should add this as a note to either @Froglegs or your answer. – Christoph Diegelmann Jul 13 '17 at 09:12
  • 1
    @PeterCordes: The `vpand` in my earlier comment can be omitted, because only the 3 least significant bits of `shufmask` are relevant for `vpermps`. This leads to [more efficient code](https://gcc.godbolt.org/z/DbPBrs) on Intel Skylake. Or [with clang](https://gcc.godbolt.org/z/E9YOow). – wim Mar 12 '19 at 08:28
  • 1
    @wim: I haven't gotten back to this answer to really look at the different options people have suggested with variable shifts instead of `pdep`, and so on. It would probably be a good idea to post an answer of your own (or feel free to edit in an AMD-friendly section in this answer, and add a caveat about `pdep` on AMD to mine.) I may get around to that at some point, but I have a chromium window with probably 30 tabs of answers I've been meaning to finish writing or get back to update... – Peter Cordes Mar 12 '19 at 08:37
  • When this is used in a loop, one sorta nifty way to do the tail would be with `vmaskmovps`. Instead of making an 8x256b LUT for the mask, you could make it on the fly with `uint64_t popc = _mm_popcnt_u64(mask); uint32_t sm1 = _bzhi_u32(-1, popc); __m256i sm2 = _mm256_set1_epi32(sm1); sm2 = _mm256_sllv_epi32(sm2, );` and then `_mm256_maskstore_ps(dest, sm2, packed);`. Maybe more efficient to just loop backwards with full 256b stores for the last few vecs, (overlapping work) but sort of a hassle to calc those indexes. – ZachB May 07 '19 at 01:49
  • 1
    @ZachB: you don't need an 8x 32-byte LUT, you only need 60-byte array and take a sliding window. (Or compressed to 15 bytes with `pmovsx` loads). [Vectorizing with unaligned buffers: using VMASKMOVPS: generating a mask from a misalignment count? Or not using that insn at all](//stackoverflow.com/a/34306934). But yes, you can calculate it instead of loading it with variable shifts, because variable shifts saturate the count. – Peter Cordes May 07 '19 at 02:04
  • @ZachB: You only need this at all if your destination buffer doesn't have enough padding at the end, e.g. because it was allocated based on a popcnt of the mask array without leaving an extra 32 bytes at the end. Depending on allocator support, you can `realloc` it down after allocating an array as large as might possibly be needed, and count on the fly. But IDK if there is an `aligned_realloc`. (Hmm, or count on the fly as you calculate the mask bitmaps, although doing that efficiently could be tricky.) – Peter Cordes May 07 '19 at 02:07
  • 1
    Good idea on `pmovsx`. Even if dest is % 32 in len, `compress256` writes garbage to the dont-care bytes, so another use case would be keeping the end untouched. (Darwin's allocator is aggressive about moving `realloc`ed mem when shrinking btw.) – ZachB May 07 '19 at 02:40
  • @ZachB: the whole point of don't-care bytes as padding is that you *are* allowed to step on them with stores :P But sure, if you specifically want *zero* padding that you can count on for horizontal sums or whatever, you could maybe do one masked store to re-zero the padding. (And BTW, you potentially need `dest` length to be `popcount(mask) + 32`, not for the buffer to end at an alignment boundary. The final 32-byte store could be from an all-zeros bitmask mask, so all 32 bytes are past the true end regardless of how that's aligned.) – Peter Cordes May 07 '19 at 02:45
  • 1
    It turns out that `pdep`/`pext` are worse than just "very slow" on AMD (18-19 cycles as you say). They are that, _plus_ 8 uops and roughly 4.5 cycles for _every set bit in the mask_, so "comically slow". For fully set masks, they can clock in at over 500 uops and close to 300 cycles. That's kernel call territory! See e.g., [here](https://twitter.com/trav_downs/status/1202793097962364928). It turns out the reported figures were apparently always with a zero mask, only the one test on uops.info with memory operands captured the full behavior by fluke. – BeeOnRope Dec 07 '19 at 21:16
  • 1
    @PeterCordes I know, right? You'd almost almost be better off with some kind of emulation that you wrote yourself. [This](https://twitter.com/uops_info/status/1202950247900684290) is a better link than the one I included above. – BeeOnRope Dec 07 '19 at 21:21
  • 1
    Huge thanks for this! Was able to use this to do compress for bytes (though haven't measured anything yet): https://github.com/DenisYaroshevskiy/algorithm_dumpster/blob/503f4d7e072e11e483868eeb8a3bfa5b02e6d685/src/simd/pack_detail/compress_mask.h#L54 – Denis Yaroshevskiy Mar 09 '20 at 00:39
  • @PeterCordes - you might be curious - I did measurements for using this to remove chars, shorts and bytes. Added as an answer to this question. – Denis Yaroshevskiy Apr 25 '20 at 19:24
  • You mention "I think clang failed to account for popcnt's dependency on its output register" ... yes, that's a known bug that should be fixed in later versions of clang. I've hit this before. For the curious, the false-dependency exists because for instructions like TZCNT, if the input registers is all 0's then the output register is unchanged (and thus depends on the original value of the output register). POPCNT uses the same chip circuitry and inherits the limitation. Chalk it up to poor ISA design on Intel's part. – Dave Dopson Oct 04 '20 at 07:55
  • @DaveDopson: Yup, updated; more recent gcc and clang know about it since [Replacing a 32-bit loop counter with 64-bit introduces crazy performance deviations with \_mm\_popcnt\_u64 on Intel CPUs](//stackoverflow.com/q/25078285) discovered it. [Why does breaking the "output dependency" of LZCNT matter?](//stackoverflow.com/q/21390165) has more / better detail. It's not bad *ISA* design, just a microarchitectural performance "bug" which got fixed for lzcnt/tzcnt in Skylake, and for popcnt in Cannon/Ice Lake. In fact providing tzcnt/lzcnt to make it possible to avoid bsf/bsr false deps is good. – Peter Cordes Oct 04 '20 at 08:46
  • But thanks for the reminder, updated my answer to mention newer compilers and link the canonical Q&As about it. – Peter Cordes Oct 04 '20 at 08:56
10

See my other answer for AVX2+BMI2 with no LUT.

Since you mention a concern about scalability to AVX512: don't worry, there's an AVX512F instruction for exactly this:

VCOMPRESSPS — Store Sparse Packed Single-Precision Floating-Point Values into Dense Memory. (There are also versions for double, and 32 or 64bit integer elements (vpcompressq), but not byte or word (16bit)). It's like BMI2 pdep / pext, but for vector elements instead of bits in an integer reg.

The destination can be a vector register or a memory operand, while the source is a vector and a mask register. With a register dest, it can merge or zero the upper bits. With a memory dest, "Only the contiguous vector is written to the destination memory location".

To figure out how far to advance your pointer for the next vector, popcnt the mask.

Let's say you want to filter out everything but values >= 0 from an array:

#include <stdint.h>
#include <immintrin.h>
size_t filter_non_negative(float *__restrict__ dst, const float *__restrict__ src, size_t len) {
    const float *endp = src+len;
    float *dst_start = dst;
    do {
        __m512      sv  = _mm512_loadu_ps(src);
        __mmask16 keep = _mm512_cmp_ps_mask(sv, _mm512_setzero_ps(), _CMP_GE_OQ);  // true for src >= 0.0, false for unordered and src < 0.0
        _mm512_mask_compressstoreu_ps(dst, keep, sv);   // clang is missing this intrinsic, which can't be emulated with a separate store

        src += 16;
        dst += _mm_popcnt_u64(keep);   // popcnt_u64 instead of u32 helps gcc avoid a wasted movsx, but is potentially slower on some CPUs
    } while (src < endp);
    return dst - dst_start;
}

This compiles (with gcc4.9 or later) to (Godbolt Compiler Explorer):

 # Output from gcc6.1, with -O3 -march=haswell -mavx512f.  Same with other gcc versions
    lea     rcx, [rsi+rdx*4]             # endp
    mov     rax, rdi
    vpxord  zmm1, zmm1, zmm1             # vpxor  xmm1, xmm1,xmm1 would save a byte, using VEX instead of EVEX
.L2:
    vmovups zmm0, ZMMWORD PTR [rsi]
    add     rsi, 64
    vcmpps  k1, zmm0, zmm1, 29           # AVX512 compares have mask regs as a destination
    kmovw   edx, k1                      # There are some insns to add/or/and mask regs, but not popcnt
    movzx   edx, dx                      # gcc is dumb and doesn't know that kmovw already zero-extends to fill the destination.
    vcompressps     ZMMWORD PTR [rax]{k1}, zmm0
    popcnt  rdx, rdx
    ## movsx   rdx, edx         # with _popcnt_u32, gcc is dumb.  No casting can get gcc to do anything but sign-extend.  You'd expect (unsigned) would mov to zero-extend, but no.
    lea     rax, [rax+rdx*4]             # dst += ...
    cmp     rcx, rsi
    ja      .L2

    sub     rax, rdi
    sar     rax, 2                       # address math -> element count
    ret

Performance: 256-bit vectors may be faster on Skylake-X / Cascade Lake

In theory, a loop that loads a bitmap and filters one array into another should run at 1 vector per 3 clocks on SKX / CSLX, regardless of vector width, bottlenecked on port 5. (kmovb/w/d/q k1, eax runs on p5, and vcompressps into memory is 2p5 + a store, according to IACA and to testing by http://uops.info/).

@ZachB reports in comments that in practice, that a loop using ZMM _mm512_mask_compressstoreu_ps is slightly slower than _mm256_mask_compressstoreu_ps on real CSLX hardware. (I'm not sure if that was a microbenchmark that would allow the 256-bit version to get out of "512-bit vector mode" and clock higher, or if there was surrounding 512-bit code.)

I suspect misaligned stores are hurting the 512-bit version. vcompressps probably effectively does a masked 256 or 512-bit vector store, and if that crosses a cache line boundary then it has to do extra work. Since the output pointer is usually not a multiple of 16 elements, a full-line 512-bit store will almost always be misaligned.

Misaligned 512-bit stores may be worse than cache-line-split 256-bit stores for some reason, as well as happening more often; we already know that 512-bit vectorization of other things seems to be more alignment sensitive. That may just be from running out of split-load buffers when they happen every time, or maybe the fallback mechanism for handling cache-line splits is less efficient for 512-bit vectors.

It would be interesting to benchmark vcompressps into a register, with separate full-vector overlapping stores. That's probably the same uops, but the store can micro-fuse when it's a separate instruction. And if there's some difference between masked stores vs. overlapping stores, this would reveal it.


Another idea discussed in comments below was using vpermt2ps to build up full vectors for aligned stores. This would be hard to do branchlessly, and branching when we fill a vector will probably mispredict unless the bitmask has a pretty regular pattern, or big runs of all-0 and all-1.

A branchless implementation with a loop-carried dependency chain of 4 or 6 cycles through the vector being constructed might be possible, with a vpermt2ps and a blend or something to replace it when it's "full". With an aligned vector store every iteration, but only moving the output pointer when the vector is full.

This is likely slower than vcompressps with unaligned stores on current Intel CPUs.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • that makes it much easier, avx512 looks very nice! – Froglegs Apr 30 '16 at 01:08
  • You *can* get rid of the sign-extension, by using `_mm_popcnt_u64()`. – EOF Apr 30 '16 at 11:11
  • @EOF: Good point. Hopefully AMD's AVX2 CPUs will run 64bit popcnt at the same speed at 32bit popcnt (unlike Steamroller where it has one per 4c throughput instead of one per 2c). Agner's tables don't report popcnt_u64 being slower on other CPUs, just Bulldozer-family. – Peter Cordes Apr 30 '16 at 13:23
  • I dunno why, but msvc also doesn't provide popcnt_u64 unless compiling in x64(disabled in 32 bit builds) – Froglegs Apr 30 '16 at 15:12
  • @Froglegs: of course it doesn't. It's a 64bit integer instruction! Stop wasting your time with 32bit builds, IMO. – Peter Cordes Apr 30 '16 at 15:24
  • but I can use 64 bit integers in 32 bit builds, so I don't get why it is disabled, ya I should probably use 64 bit builds – Froglegs Apr 30 '16 at 15:37
  • @Froglegs: look at [the asm for adding two `int64_t`s in 32bit code (`gcc -m32`)](https://godbolt.org/g/fFTz7i): it's two 32bit instructions, `add` and `adc`. If you used `std::bitset<64>::count()` in 32bit code, you'd get a 32bit popcnt of each half, and an add. But `_popcnt_u64` is an intrinsic specifically for the `popcnt r64, r/m64` instruction, not for population-count by any method available. If you're not using gcc, you should use `_popcnt_u32`, but you should still make 64bit builds so your code runs faster! – Peter Cordes Apr 30 '16 at 15:56
  • 1
    Your AVX2 version benchmarks *slightly* (~3%) faster than this version on CSL with GCC8.2. Impressive work there. (The AVX2 version also runs ~4.52x faster than the SSE2 LUT version.) – ZachB May 05 '19 at 11:18
  • @ZachB: Cool. There are some suggestions in comments for alternatives with variable shifts that might be worth trying, I forget if anyone came up with anything that was likely better on Intel as well as AMD (where `pdep` is slow.) Or did you mean to say AVX512, since you're commenting under this answer and getting ~4x the SSE2 version? – Peter Cordes May 05 '19 at 11:38
  • @PeterCordes as far as I understood the comments on the AVX2 version, they all rely on the mask starting out expanded (__m256) instead of being a compressed bit mask (int). I'm starting from the latter and didn't see a way to adapt those suggestions. Not sure what you're asking in last sentence; I tried this AVX512 version on both SKL and CSL but don't have perf counters on either. I was hoping `VCOMPRESSPS` would magically be faster esp. since it's wider. Agner only has timings for that on KNL. – ZachB May 05 '19 at 19:02
  • @ZachB: Ok, I'm surprised at a 4x speedup only doubling the vector width. Agner has SKX numbers, in the spreadsheet tab next to SKL. (Skylake-X and Cascade-Lake X CLX are the server versions with AVX512. SKL often means specifically the client chips without AVX512). `vcompressps` with a vector destination is 2 uops for port 5 so there's a bit of a bottleneck. Presumably as a store it's 3 or 4 uops. – Peter Cordes May 05 '19 at 21:03
  • Probably worth trying AVX512 with 256-bit vectors; using 512-bit vectors disables the vector ALUs on port 1, and reduces max turbo significantly on server chips. Unless you already use 512-bit vectors before and after. Or maybe you run into a memory bottleneck. I would have expected `vcompressps` to be faster than imul + pdep + pext (3 uops for port 1 = bottleneck) as well as 2x shuffles. Or maybe the same speed since it's also 2 port-5 uops. – Peter Cordes May 05 '19 at 21:06
  • 1
    Sorry for the unclear comments. On SKL your AVX2 pdep/pext/shuf is ~4.5x faster than @ZBoson's SSE2 LUT version. On SKX and CLX this 512-bit `vcompressps` version was ~3% slower than pdep/pext/shuf run on the same chips. Since the pdep/pext/shuf version was slightly faster, I think that means it's not mem-bottlenecked. I don't have PMU access on SKX/CLX tho. On CLX, 256-bit `vcompressps` is ~10% faster than 512-bit `vcompressps`; ~6% faster than pdep/pex/shuf. – ZachB May 05 '19 at 21:43
  • I still can't find a `vcompressps` entry in Agner's tables for anything but KNL btw. Are you assuming `vPcompress` has same timings as `vcompress`? – ZachB May 05 '19 at 21:43
  • @ZachB: are you sure you have the latest version? I'm looking at the `.ods` spreadsheet from his 2018-sep .zip. The row is `VCOMPRESPS/PD v{k},v 2 2 p5 3 2` between the `VMASKMOVPS/D`and `VEXPANDPS/PD` rows, in the SKX tab (not SKL). Between SKL and Pentium 4. – Peter Cordes May 05 '19 at 21:52
  • @PeterCordes gah, Agner's table has `VCOMPRESPS/PD` with only one S. The Intel intrinsics guide and arch manual has `vcompressps` with two Ss. /facepalm. – ZachB May 05 '19 at 21:58
  • @ZachB: Oh crap, I didn't even notice until you pointed it out! I happened to be searching on `vcompr` to find it. I randomly decided that was a good probably-unique prefix to find the entry in LibreOffice search. – Peter Cordes May 05 '19 at 22:10
  • 1
    @ZachB: I sent Agner a message about that mistake via his blog (https://www.agner.org/optimize/blog/read.php?i=962), so it should be fixed in the next revision of the tables. http://www.uops.info/html-lat/SKX/VCOMPRESSPS_Z_ZMM_K_ZMM-Measurements.html has SKX latency from vector to result (3c) and from mask to result (6c), as well as actual measurements + IACA output in their table. Memory-destination `vcompressps` is 4 uops like I guessed, no micro-fusion of the store. – Peter Cordes May 05 '19 at 23:03
  • 1
    @ZachB: I think some of the AVX2 suggestions for using variable-shifts *do* work for mask bitmaps, not vector compare masks. You can go from bitmap to vector cheaply with a broadcast + variable shift, e.g. `_mm256_set1_epi32(mask[i])` and then variable-shift to put the appropriate bit as the high bit of each element. Or with AVX512, [`vpmovm2d`](https://www.felixcloutier.com/x86/vpmovm2b:vpmovm2w:vpmovm2d:vpmovm2q). But then you need each chunk of the mask in a `k` register, and loads into `k` registers are expensive. Cheaper to broadcast-load 32 bits of mask and then shift multiple ways. – Peter Cordes May 10 '19 at 01:13
  • @ZachB: I'm also surprised that `vcompressps` isn't doing better. Maybe the compiler is doing narrow loads into `k` registers, or not unrolling, or something. Agner measured `kmovw k1, [rdi]` at 3 uops on SKX, and that's what IACA says (load + p0156 + p5), but http://www.uops.info/table.html says it's only 2 uops. (load + p5). But since we need to popcnt as well, I guess we'd want to `movzx eax, word [rdi]` load into an integer register and popcnt, and `kmovw k1, eax`. `vcompressps` should go at 1 vector per 3 clocks, bottlenecked on port 5, including 2p5 for it + p5 for kmov. – Peter Cordes May 10 '19 at 01:21
  • 1
    @PeterCordes oh, good idea -- I'm actually using that broadcast+variable shift technique to make the mask for `vmaskmovps` in the last iterations, didn't think about applying it to the earlier comments. -- On `vcompressps`, I'm using 256b ops b/c it's marginally faster than 512b; so `movzx eax, byte [rdi]` , `kmovb k1, eax`. https://godbolt.org/z/BUw7XL is the fastest I've got to for AVX2 and AVX512. Unrolling 2x or 4x hasn't helped with AVX2, remains bottlenecked on p1 and p5. Don't have PMU access on CLX/SKX but no measurable time difference there either. – ZachB May 10 '19 at 05:30
  • @ZachB: I wonder if misaligned stores are hurting the 512-bit version. It probably generates a mask and uses it for a masked 256 or 512-bit store, and if that crosses a cache line boundary then it has to get processed separately. And perhaps misaligned 512-bit stores are worse than cache-line-split 256-bit stores for some reason; we already know that 512-bit vectorization of other things seems to be more alignment sensitive than we might expect just from every access being a cache-line split (e.g. when you'd expect a memory bottleneck to hide it) so maybe there's something going on there. – Peter Cordes May 10 '19 at 06:20
  • @ZachB: If you're finding that AVX2-only `pdep` is faster than `vcompressps` but you're still targeting AVX512VL, you *could* use `vcompressps` for the cleanup to keep it simple. And BTW, see [is there an inverse instruction to the movemask instruction in intel avx2?](//stackoverflow.com/a/36491672) for more ideas about bitmap -> vector options. – Peter Cordes May 10 '19 at 06:25
  • 1
    @PeterCordes if it's slowed down heavily by unaligned stores, I wonder if using `vpermt2ps` to build up full, 64B-aligned vectors before storing to mem would help. – ZachB May 10 '19 at 06:51
  • @ZachB. Neat idea, yeah that may be possible. It will create an arbitrarily long dependency chain of shuffles until you have a full vector. Branching once you have a full vector would break the dep chain but could mispredict. Probably anything branchless (re-storing the same vector and using `cmov` to update the dst pointer) would create a serial dependency through a chain of vector shuffles for the entire array. But only a 1 shuffle per input vector, so 3 cycle latency; preparing a control vector is independent per iteration – Peter Cordes May 10 '19 at 06:59
  • 1
    But the shuffle control you build requires accounting for the current length of the vector you're building, not just how many new elements are in the current source vector. And you have to deal with left-over elements that didn't fit into a full vector when you *do* complete one. I think that's going to be a problem. Maybe we can branch on it, but doing it branchlessly probably introduces another at least 1 cycle to blend into the "accumulator", or 3 cycles to shuffle into it with another `vpermt2ps` with a control vec that either keeps it or packs the leftovers. – Peter Cordes May 10 '19 at 07:01
  • 1
    @PeterCordes arghh, I was dumb and benching with arrays that hit RAM. New numbers on CLX where src, dst and mask together fit in L2, arbitrary ops/sec: `pdep/pext/shuf` 200, 256b `vcompressps` 271, 512b `vcompressps` 305; so yes, good on `vcompressps`, 1.35 or 1.52x faster. Also tried with aligned stores (advancing dst by 32 or 64B instead of popcnt) but whatever improvement that may have was negated by greater access. When hitting RAM or L3, 512b `vcompressps` is consistently ~3% slower than 256b version. (Edited to fix L3 comment.) – ZachB May 10 '19 at 23:35
  • 1
    In an unrelated but also very simple case (`vpshufb` in a loop on 64B-aligned data on CLX or SKX), the 512b version was ~28% faster than the 256b version when the data fit in L1; but 512b was 11% slower than 256b for L2, and 15% slower for L3. If it was just a matter of down-clocking from using the full 512b execution units, I would have expected L1 to suffer as well. Not sure why this happens. – ZachB May 11 '19 at 01:59
  • 1
    @ZachB: We know that aligned 512-bit access to L1d is efficient, and 512-bit downclocking isn't a factor of 2. Or at least that 512-bit loads can execute at 2 per clock. IDK about store throughput for commit from store buffer to L1d. I'd have expected a speedup closer to the factor of 2 from the vector width, then downclocking to bring it down to like 70 or 80% faster at worst. Were you doing a masked store or anything unusual or that could cause a loop-carried dependency? If 512-bit vectors made the CPU run more than twice as slow, they'd be mostly pointless – Peter Cordes May 11 '19 at 02:21
  • @ZachB: I wouldn't be surprised if a `_mm512_maskz_compress_ps` followed by a full 512 bit overlapping store performs better than the masked store in `_mm512_mask_compressstoreu_ps`. `_mm512_mask_compressstoreu_ps` has high latency and two subsequent masked stores to the same cache line may hurt each other probably. Note that `_mm512_maskz_compress_ps` has to load data from memory, modify it with the compressed data, and store it back. So, at least overlapping stores are less micro-ops (no load needed). – wim May 11 '19 at 12:06
  • ....(continued): I don't have AVX-512 hardware, but with AVX2 I found once in some particular case that overlapping stores were more efficient than masked stores. – wim May 11 '19 at 12:07
  • @wim: `_mm512_maskz_compress_ps` (or `vcompressps` in asm) decodes to uops that don't include a load. It merges into cache *not* by loading, but by leaving some bytes unmodified when committing from the store buffer. (At least that's my understanding. Possibly there's a RMW cycle *inside* the cache write path, but if so it's invisible and effectively atomic. Unlikely though.) Skylake and later have hardware support for masked stores, because they're first-class citizens in AVX512. I wouldn't be surprised if overlapping stores are even more efficient, but you can't always do that. – Peter Cordes May 11 '19 at 12:16
  • @PeterCordes: You are right, I misinterpreted the http://uops.info. But it would be nice to see if there is actually a performance difference between the two options. – wim May 11 '19 at 12:34
  • 1
    @PeterCordes nothing unusual, no loop-carried deps: https://godbolt.org/z/9SQYCy. (Same source on both sides, just with Vec256 or Vec512 as the template param.) Worth opening a separate question about why this could be? – ZachB May 11 '19 at 21:40
  • 1
    @wim interesting idea, but that benchmarks the same on CLX. – ZachB May 11 '19 at 21:46
  • @ZachB: yeah, definitely would make an interesting new question. \@wim's idea of using `vcompressps zmm{k1}, zmm` and then a separate regular 512-bit is interesting, (but needs a separate cleanup loop once you get < 64 bytes from the end using vcompressps with a memory destination). – Peter Cordes May 12 '19 at 01:51
  • 1
    @wim: yes, definitely worth testing in case there's something funny, like maybe the store buffer doing better at coalescing unmasked stores (which it hopefully can if there are no other stores between. IDK if we need to avoid loads between stores; if coalescing happens post retirement then loads won't matter. `vcompressps zmm{k1}, zmm` and then a separate store should be fewer fused-domain uops for the front end, because a separate store can micro-fuse the store-address+store-data. So you have 2+1 instead of 4. – Peter Cordes May 12 '19 at 01:56
8

If you are targeting AMD Zen this method may be preferred, due to the very slow pdepand pext on ryzen (18 cycles each).

I came up with this method, which uses a compressed LUT, which is 768(+1 padding) bytes, instead of 8k. It requires a broadcast of a single scalar value, which is then shifted by a different amount in each lane, then masked to the lower 3 bits, which provides a 0-7 LUT.

Here is the intrinsics version, along with code to build LUT.

//Generate Move mask via: _mm256_movemask_ps(_mm256_castsi256_ps(mask)); etc
__m256i MoveMaskToIndices(u32 moveMask) {
    u8 *adr = g_pack_left_table_u8x3 + moveMask * 3;
    __m256i indices = _mm256_set1_epi32(*reinterpret_cast<u32*>(adr));//lower 24 bits has our LUT

   // __m256i m = _mm256_sllv_epi32(indices, _mm256_setr_epi32(29, 26, 23, 20, 17, 14, 11, 8));

    //now shift it right to get 3 bits at bottom
    //__m256i shufmask = _mm256_srli_epi32(m, 29);

    //Simplified version suggested by wim
    //shift each lane so desired 3 bits are a bottom
    //There is leftover data in the lane, but _mm256_permutevar8x32_ps  only examines the first 3 bits so this is ok
    __m256i shufmask = _mm256_srlv_epi32 (indices, _mm256_setr_epi32(0, 3, 6, 9, 12, 15, 18, 21));
    return shufmask;
}

u32 get_nth_bits(int a) {
    u32 out = 0;
    int c = 0;
    for (int i = 0; i < 8; ++i) {
        auto set = (a >> i) & 1;
        if (set) {
            out |= (i << (c * 3));
            c++;
        }
    }
    return out;
}
u8 g_pack_left_table_u8x3[256 * 3 + 1];

void BuildPackMask() {
    for (int i = 0; i < 256; ++i) {
        *reinterpret_cast<u32*>(&g_pack_left_table_u8x3[i * 3]) = get_nth_bits(i);
    }
}

Here is the assembly generated by MSVC:

  lea ecx, DWORD PTR [rcx+rcx*2]
  lea rax, OFFSET FLAT:unsigned char * g_pack_left_table_u8x3 ; g_pack_left_table_u8x3
  vpbroadcastd ymm0, DWORD PTR [rcx+rax]
  vpsrlvd ymm0, ymm0, YMMWORD PTR __ymm@00000015000000120000000f0000000c00000009000000060000000300000000
  
Community
  • 1
  • 1
Froglegs
  • 1,095
  • 1
  • 11
  • 21
  • It looks like you're inventing your own function names for a lot of stuff. e.g. is `shift_right_zero_extend_v` supposed to be `_mm_srlv_epi32`? I don't really like Intel's intrinsic names (too much typing, their asm mnemonics are much nicer), but it's clearer. Using `auto` doesn't even make it clear that the result is a vector, esp. since `m & 7` is taking advantage of GNU C vector syntax where a scalar can be implicitly broadcast when used with a vector. – Peter Cordes Apr 30 '16 at 01:21
  • Anyway, nice job with using a smaller LUT, I think, but the code might as well be pseudo-code. (Or maybe it's intended that way?) Also, you don't show the popcnt of the mask, and this functions doesn't return the `movmskps` result. So I guess the caller will also have to `movemask`, and then it's up to the compiler to CSE that away? Poor design IMO: have this function take an `unsigned int` mask. – Peter Cordes Apr 30 '16 at 01:25
  • It is real C++ code that compiles actually-- I don't use the intrinsics manually, I have my own wrappers. This was just a test function, for usage I'd have to take the movemask out of the body of the function, you are correct. The m & 7 bit broadcasts 7 to a register and calls _mm256_and_si256. – Froglegs Apr 30 '16 at 02:32
  • 1
    My point was that writing it the boring / annoying way with Intel's really long function names will make it a better answer, since it makes it clearer exactly what steps are taken. I think your LUT has shuffle masks packed into 3 bytes. And you decompress with `pmovzx` or something, then `vpsrlv`, then mask away high garbage in each element? Or are broadcasting one 32b element and then using a variable shift to extract eight 3b elements? I think the latter. Feel free to copy/paste my text description of what you do. – Peter Cordes Apr 30 '16 at 02:38
  • 1
    Ya, perhaps I should post it with raw intrinsics then, I'll convert it over and post it again. I can post the table gen code also – Froglegs Apr 30 '16 at 02:39
  • 1
    I posted the raw intrinsics code and LUT gen code. Yeah I broadcast 1 32bit integer, but only use the lower 24 bits of it. Each 3 bits contains the index to load from(0-7). – Froglegs Apr 30 '16 at 02:58
  • Your packed LUT entries gave me an idea for generating them on the fly, and thus avoiding the LUT. See my 2nd answer that should work well on existing CPUs with AVX2. It has fairly poor latency, but good throughput. – Peter Cordes Apr 30 '16 at 06:37
  • 1
    @Froglegs: I think you can use a single `_mm256_srlv_epi32` instead of `_mm256_sllv_epi32`, and `_mm256_srli_epi32`, since you only need the 3 bits (per element) at the right position, because `_mm256_permutevar8x32_ps` doesn't care about garbage in the upper 29 bits. – wim Mar 12 '19 at 09:09
  • 1
    hi wim, thanks for the tip. You are correct that only the lower 3 bits matter, I've updated the post so it shows your suggestion. – Froglegs Mar 13 '19 at 16:38
  • I was going to comment the same thing as wim, but your intrinsics version already has the optimization. Looks like you forgot to update the asm, which is what I was looking at. (https://godbolt.org/ has MSVC installed). With a different choice of types, you can probably also avoid the `movsxd` sign-extension. (It may go away after inlining, though.) – Peter Cordes May 08 '19 at 04:36
  • I updated the asm-- also as Peter suggested, by making the movemask unsigned the movsxd went away. – Froglegs May 09 '19 at 15:10
7

Will add more information to a great answer from @PeterCordes : https://stackoverflow.com/a/36951611/5021064.

I did the implementations of std::remove from C++ standard for integer types with it. The algorithm, once you can do compress, is relatively simple: load a register, compress, store. First I'm going to show the variations and then benchmarks.

I ended up with two meaningful variations on the proposed solution:

  1. __m128i registers, any element type, using _mm_shuffle_epi8 instruction
  2. __m256i registers, element type of at least 4 bytes, using _mm256_permutevar8x32_epi32

When the types are smaller then 4 bytes for 256 bit register, I split them in two 128 bit registers and compress/store each one separately.

Link to compiler explorer where you can see complete assembly (there is a using type and width (in elements per pack) in the bottom, which you can plug in to get different variations) : https://gcc.godbolt.org/z/yQFR2t

NOTE: my code is in C++17 and is using a custom simd wrappers, so I do not know how readable it is. If you want to read my code -> most of it is behind the link in the top include on godbolt. Alternatively, all of the code is on github.

Implementations of @PeterCordes answer for both cases

Note: together with the mask, I also compute the number of elements remaining using popcount. Maybe there is a case where it's not needed, but I have not seen it yet.

Mask for _mm_shuffle_epi8

  1. Write an index for each byte into a half byte: 0xfedcba9876543210
  2. Get pairs of indexes into 8 shorts packed into __m128i
  3. Spread them out using x << 4 | x & 0x0f0f

Example of spreading the indexes. Let's say 7th and 6th elements are picked. It means that the corresponding short would be: 0x00fe. After << 4 and | we'd get 0x0ffe. And then we clear out the second f.

Complete mask code:

// helper namespace
namespace _compress_mask {

// mmask - result of `_mm_movemask_epi8`, 
// `uint16_t` - there are at most 16 bits with values for __m128i. 
inline std::pair<__m128i, std::uint8_t> mask128(std::uint16_t mmask) {
    const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x1111111111111111) * 0xf;

    const std::uint8_t offset = 
        static_cast<std::uint8_t>(_mm_popcnt_u32(mmask));  // To compute how many elements were selected

    const std::uint64_t compressed_idxes = 
        _pext_u64(0xfedcba9876543210, mmask_expanded); // Do the @PeterCordes answer

    const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes); // 0...0|compressed_indexes
    const __m128i as_16bit = _mm_cvtepu8_epi16(as_lower_8byte);         // From bytes to shorts over the whole register
    const __m128i shift_by_4 = _mm_slli_epi16(as_16bit, 4);             // x << 4
    const __m128i combined = _mm_or_si128(shift_by_4, as_16bit);        // | x
    const __m128i filter = _mm_set1_epi16(0x0f0f);                      // 0x0f0f
    const __m128i res = _mm_and_si128(combined, filter);                // & 0x0f0f

    return {res, offset};
}

}  // namespace _compress_mask

template <typename T>
std::pair<__m128i, std::uint8_t> compress_mask_for_shuffle_epi8(std::uint32_t mmask) {
     auto res = _compress_mask::mask128(mmask);
     res.second /= sizeof(T);  // bit count to element count
     return res;
}

Mask for _mm256_permutevar8x32_epi32

This is almost one for one @PeterCordes solution - the only difference is _pdep_u64 bit (he suggests this as a note).

The mask that I chose is 0x5555'5555'5555'5555. The idea is - I have 32 bits of mmask, 4 bits for each of 8 integers. I have 64 bits that I want to get => I need to convert each bit of 32 bits into 2 => therefore 0101b = 5.The multiplier also changes from 0xff to 3 because I will get 0x55 for each integer, not 1.

Complete mask code:

// helper namespace
namespace _compress_mask {

// mmask - result of _mm256_movemask_epi8
inline std::pair<__m256i, std::uint8_t> mask256_epi32(std::uint32_t mmask) {
    const std::uint64_t mmask_expanded = _pdep_u64(mmask, 0x5555'5555'5555'5555) * 3;

    const std::uint8_t offset = static_cast<std::uint8_t(_mm_popcnt_u32(mmask));  // To compute how many elements were selected

    const std::uint64_t compressed_idxes = _pext_u64(0x0706050403020100, mmask_expanded);  // Do the @PeterCordes answer

    // Every index was one byte => we need to make them into 4 bytes
    const __m128i as_lower_8byte = _mm_cvtsi64_si128(compressed_idxes);  // 0000|compressed indexes
    const __m256i expanded = _mm256_cvtepu8_epi32(as_lower_8byte);  // spread them out
    return {expanded, offset};
}

}  // namespace _compress_mask

template <typename T>
std::pair<__m256i, std::uint8_t> compress_mask_for_permutevar8x32(std::uint32_t mmask) {
    static_assert(sizeof(T) >= 4);  // You cannot permute shorts/chars with this.
    auto res = _compress_mask::mask256_epi32(mmask);
    res.second /= sizeof(T);  // bit count to element count
    return res;
}

Benchmarks

Processor: Intel Core i7 9700K (a modern consumer level CPU, no AVX-512 support)
Compiler: clang, build from trunk near the version 10 release
Compiler options: --std=c++17 --stdlib=libc++ -g -Werror -Wall -Wextra -Wpedantic -O3 -march=native -mllvm -align-all-functions=7
Micro-benchmarking library: google benchmark

Controlling for code alignment:
If you are not familiar with the concept, read this or watch this
All functions in the benchmark's binary are aligned to 128 byte boundary. Each benchmarking function is duplicated 64 times, with a different noop slide in the beginning of the function (before entering the loop). The main numbers I show is min per each measurement. I think this works since the algorithm is inlined. I'm also validated by the fact that I get very different results. At the very bottom of the answer I show the impact of code alignment.
Note: benchmarking code. BENCH_DECL_ATTRIBUTES is just noinline

Benchmark removes some percentage of 0s from an array. I test arrays with {0, 5, 20, 50, 80, 95, 100} percent of zeroes.
I test 3 sizes: 40 bytes (to see if this is usable for really small arrays), 1000 bytes and 10'000 bytes. I group by size because of SIMD depends on the size of the data and not a number of elements. The element count can be derived from an element size (1000 bytes is 1000 chars but 500 shorts and 250 ints). Since time it takes for non simd code depends mostly on the element count, the wins should be bigger for chars.

Plots: x - percentage of zeroes, y - time in nanoseconds. padding : min indicates that this is minimum among all alignments.

40 bytes worth of data, 40 chars

40 bytes, chars

For 40 bytes this does not make sense even for chars - my implementation gets about 8-10 times slower when using 128 bit registers over non-simd code. So, for example, compiler should be careful doing this.

1000 bytes worth of data, 1000 chars

1000 chars

Apparently the non-simd version is dominated by branch prediction: when we get small amount of zeroes we get a smaller speed up: for no 0s - about 3 times, for 5% zeroes - about 5-6 times speed up. For when the branch predictor can't help the non-simd version - there is about a 27 times speed up. It's an interesting property of simd code that it's performance tends to be much less dependent on of data. Using 128 vs 256 register shows practically no difference, since most of the work is still split into 2 128 registers.

1000 bytes worth of data, 500 shorts

1000 bytes, shorts

Similar results for shorts except with a much smaller gain - up to 2 times. I don't know why shorts do that much better than chars for non-simd code: I'd expect shorts to be two times faster, since there are only 500 shorts, but the difference is actually up to 10 times.

1000 bytes worth of data, 250 ints

1000 bytes, ints

For a 1000 only 256 bit version makes sense - 20-30% win excluding no 0s to remove what's so ever (perfect branch prediction, no removing for non-simd code).

10'000 bytes worth of data, 10'000 chars

10'000 bytes, chars

The same order of magnitude wins as as for a 1000 chars: from 2-6 times faster when branch predictor is helpful to 27 times when it's not.

Same plots, only simd versions:

10'000 chars, no non-simd

Here we can see about a 10% win from using 256 bit registers and splitting them in 2 128 bit ones: about 10% faster. In size it grows from 88 to 129 instructions, which is not a lot, so might make sense depending on your use-case. For base-line - non-simd version is 79 instructions (as far as I know - these are smaller then SIMD ones though).

10'000 bytes worth of data, 5'000 shorts

10'000 bytes, shorts

From 20% to 9 times win, depending on the data distributions. Not showing the comparison between 256 and 128 bit registers - it's almost the same assembly as for chars and the same win for 256 bit one of about 10%.

10'000 bytes worth of data, 2'500 ints

10'000 bytes, ints

Seems to make a lot of sense to use 256 bit registers, this version is about 2 times faster compared to 128 bit registers. When comparing with non-simd code - from a 20% win with a perfect branch prediction to 3.5 - 4 times as soon as it's not.

Conclusion: when you have a sufficient amount of data (at least 1000 bytes) this can be a very worthwhile optimisation for a modern processor without AVX-512

PS:

On percentage of elements to remove

On one hand it's uncommon to filter half of your elements. On the other hand a similar algorithm can be used in partition during sorting => that is actually expected to have ~50% branch selection.

Code alignment impact

The question is: how much worth it is, if the code happens to be poorly aligned (generally speaking - there is very little one can do about it).
I'm only showing for 10'000 bytes.
The plots have two lines for min and for max for each percentage point (meaning - it's not one best/worst code alignment - it's the best code alignment for a given percentage).

Code alignment impact - non-simd

Chars: code alignment, chars

From 15-20% for poor branch prediction to 2-3 times when branch prediction helped a lot. (branch predictor is known to be affected by code alignment).

Shorts: code alignment, shorts

For some reason - the 0 percent is not affected at all. It can be explained by std::remove first doing linear search to find the first element to remove. Apparently linear search for shorts is not affected. Other then that - from 10% to 1.6-1.8 times worth

Ints: code alignment, ints

Same as for shorts - no 0s is not affected. As soon as we go into remove part it goes from 1.3 times to 5 times worth then the best case alignment.

Code alignment impact - simd versions

Not showing shorts and ints 128, since it's almost the same assembly as for chars

Chars - 128 bit register code alignment, chars - 128 About 1.2 times slower

Chars - 256 bit register code alignment, chars - 256 About 1.1 - 1.24 times slower

Ints - 256 bit register code alignment, int - 256 1.25 - 1.35 times slower

We can see that for simd version of the algorithm, code alignment has significantly less impact compared to non-simd version. I suspect that this is due to practically not having branches.

Denis Yaroshevskiy
  • 1,218
  • 11
  • 24
  • I have a wild guess about the scalar `char` results being so much slower than `short`: clang is often reckless with false dependencies when using 8-bit integers, e.g. `mov al, [mem]` merging into RAX instead of `movzx eax, byte [mem]` to zero-extend with no dependency on the old contents. Intel since Haswell or so doesn't rename AL separately from RAX (instead merging) so this false dependency can create a loop-carried dependency chain. Maybe with `short` it's avoiding 16-bit operand-size by using `movzx` or `movsx` loads. I haven't checked the asm yet. – Peter Cordes Apr 25 '20 at 21:20
  • code: alignment: i7-9700k is Coffee Lake, which has a working loop buffer (LSD), unlike earlier Skylake-based microarchitectures where microcode updates disabled the LSD. So I guess the loop is too big to fit in the LSD. Except for special cases like when `std::remove` is just doing a linear search for any elements to keep; that tight loop presumably runs from the LSD even if clang unrolls it. – Peter Cordes Apr 25 '20 at 21:31
  • 1
    Hmm, a mixed scalar / SIMD strategy could be good for that sparse case, using branchless SIMD to scan the next 16 or 32 bytes for a non-matching element. (`vpcmpeqb` / `vpmovmskb` / `tzcnt`). But that creates a dependency chain that couples into the next load address so it's potentially horrible. Hmm, maybe looping over the set bits in the mask would be better, `blsr` to reset the lowest set bit, `tzcnt` to find that offset, and scalar copy into `*dst++` ... – Peter Cordes Apr 25 '20 at 21:34
  • ... With software pipelining of the outer loop, you could be loading and comparing to get the mask for the *next* loop before doing the current inner loop, so that work can be in flight when the loop branch in this loop-over-mask-bits mispredicts on loop exit. And you can combine masks into a 64-bit integer so you stay in that inner loop longer. So you might have one mispredict per 64 input elements, however many output elements that is. And consistent patterns might make that predictable. – Peter Cordes Apr 25 '20 at 21:37
  • @PeterCordes - not sure if I follow everything. 1) That's an interesting guess, I didn't realise that was a thing. Seems like clang generates `movzx [byte]`: https://gcc.godbolt.org/z/xzy_-v 2) From what I heard - LSD should kick in after 64 iterations and after that the alignment should not matter. However, I had not seen it happen (not for remove not for other algorithms). std::remove is quite small: https://github.com/llvm/llvm-project/blob/969e7edd88ceb4791eb1cac22828290f0ae30b3d/libcxx/include/algorithm#L2101 Do you maybe know why alignment is still this much of an issue? – Denis Yaroshevskiy Apr 25 '20 at 22:01
  • 3) Just clarifying what you mean: mostly zeroes, one check for multiple elements and then compute indexes of elements to store by doing bit operations to compute offsets? Nit. – Denis Yaroshevskiy Apr 25 '20 at 22:03
  • The LSD can only hold 64 uops; the IDQ that feeds the issue stage can lock down those uops when it notices a loop, turning the IDQ into the loop buffer. If you look at the `lsd.uops` perf counter, out of the total `uops_issued.any`, you can see how many uops came from the loop buffer. – Peter Cordes Apr 25 '20 at 22:11
  • 1
    3) yeah, for a case where most elements get removed, keeping only a few, I guess you'd invert the mask so the elements you wanted to keep were the `1` bits. And yeah, then you iterate `mask &= mask-1` (BLSR) to loop over just the set bits. With BMI1 that has single-cycle latency as a loop-carried dependency. In each iteration, you do `*dst++ = srcptr[tzcnt(mask)];`. Where `srcptr` is the start of the 64-element chunk that `mask` was derived from. So the scalar work is BLSR / jnz (loop carried), and not loop-carried: TZCNT, mov load with scaled-index addressing, mov store, dst++. – Peter Cordes Apr 25 '20 at 22:16
  • @PeterCordes on code alignment: I ran perf for best/worst cases for remove/int/10'000. Both show 0 for LSD.UOPS. They show though a difference in mispredicted branches: 0.13% for the best case vs 1.08% for the worst case. If you can have a look, here is the output from perf (using it for the second time ever - most likely I made a mistake): https://gist.github.com/DenisYaroshevskiy/4546197583a4a5abd485dea5397abaa4 – Denis Yaroshevskiy Apr 26 '20 at 00:04
  • Exactly zero `lsd.uops` for the whole program for all cases sounds like it hasn't actually been re-enabled in your Coffee Lake CPU, contrary to what https://en.wikichip.org/wiki/intel/microarchitectures/coffee_lake#Key_changes_from_Kaby_Lake says :( You're using `perf stat -e` correctly. BTW, you can set `/proc/sys/kernel/perf_event_paranoid` to `0` so you don't need `sudo` for perf to work. (Or some higher level to still stop kernel profiling but allow user-space.) – Peter Cordes Apr 26 '20 at 00:12
  • Even without the LSD, the uop cache can often keep up with the front-end. Although with up-to-date microcode to fix the recent JCC erratum, you might see more impact for code alignment than normal if GCC isn't aligning JCC instructions to avoid legacy decode for a given 32-byte block. Hmm, I wonder if maybe Intel disabled the LSD on Coffee Lake because of that, if they had it working previously. [32-byte aligned routine does not fit the uops cache](https://stackoverflow.com/a/61016915) / https://www.phoronix.com/scan.php?page=article&item=intel-jcc-microcode&num=1 – Peter Cordes Apr 26 '20 at 00:17
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/212510/discussion-between-denis-yaroshevskiy-and-peter-cordes). – Denis Yaroshevskiy Apr 26 '20 at 00:19
6

In case anyone is interested here is a solution for SSE2 which uses an instruction LUT instead of a data LUT aka a jump table. With AVX this would need 256 cases though.

Each time you call LeftPack_SSE2 below it uses essentially three instructions: jmp, shufps, jmp. Five of the sixteen cases don't need to modify the vector.

static inline __m128 LeftPack_SSE2(__m128 val, int mask)  {
  switch(mask) {
  case  0:
  case  1: return val;
  case  2: return _mm_shuffle_ps(val,val,0x01);
  case  3: return val;
  case  4: return _mm_shuffle_ps(val,val,0x02);
  case  5: return _mm_shuffle_ps(val,val,0x08);
  case  6: return _mm_shuffle_ps(val,val,0x09);
  case  7: return val;
  case  8: return _mm_shuffle_ps(val,val,0x03);
  case  9: return _mm_shuffle_ps(val,val,0x0c);
  case 10: return _mm_shuffle_ps(val,val,0x0d);
  case 11: return _mm_shuffle_ps(val,val,0x34);
  case 12: return _mm_shuffle_ps(val,val,0x0e);
  case 13: return _mm_shuffle_ps(val,val,0x38);
  case 14: return _mm_shuffle_ps(val,val,0x39);
  case 15: return val;
  }
}

__m128 foo(__m128 val, __m128 maskv) {
  int mask = _mm_movemask_ps(maskv);
  return LeftPack_SSE2(val, mask);
}
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 2
    If you're going to branch on the mask, you might as well hard-code the popcnt in each case. Return it in an `int *` parameter or something. (`popcnt` came after `pshufb`, so if you have to fall back to an SSE2 version, you don't have hardware popcnt either.) If SSSE3 `pshufb` is available, a (data) LUT of shuffle masks may be better if the data is unpredictable. – Peter Cordes May 01 '16 at 15:56
  • Since the pshufb masks have a known relationship inside each group of 4B, they could be compressed from `[ D+3 D+2 D+1 D | C+3 ... ]` down to just 4B `[ D C B A ]`, and unpacked with `punpcklbw same,same` / `punpcklwd same,same` / `paddb x, [ 3 2 1 0 | 3 2 1 0 | ... ]`. That's 3 shuffles and an add instead of just one pshufb, though. Or unpack the mask with a `pshufb`, so it's 2 shuffles and a paddb. Anyway, that makes the LUT only 16 * 4B = 64B = one cache line, at the cost of needing two other 16B constants in registers, or as memory operands. – Peter Cordes May 01 '16 at 16:06
  • @PeterCordes, good point about `popcnt`. I thought of that but the OP did not request it even though it seems like something the OP would need. When I look at the jump table in GCC I can't make out yet how GCC created it. I would expect it jump in the order I listed the codes 0,1,2,3...15 (or maybe reverse order) but it seems random. It's 0xe, 0x34, 0x1, 0x38, 0x2, 0x8,0x9, 0x39, 0x3, 0xc, oxd. If you look at the codes above you can see that that's not at all in order. Do you know why this is? – Z boson May 01 '16 at 18:17
  • 1
    Maybe it started to order it for a decision-tree of branches before deciding on a jump-table strategy. It amuses me that [when making PIC code](https://godbolt.org/g/UxcpHb), it decided on a table of 4B displacements that it loads with `movsx`. If it's going to `movsx` anyway, might as well use 1B displacements for a smaller table. It also doesn't know that the input will always be 0..15, so it checks for outside that range and returns zero :/ – Peter Cordes May 01 '16 at 19:31
  • @PeterCordes, what is the `.L4[0+rax*8]] .L4: .quad .L1` stuff? Is there someway to get GCC to ouput the hex values rather than decimal values? I used objdump instead of GCC -S because objdump shows the hex values. – Z boson May 02 '16 at 06:48
  • @PeterCordes, maybe [this link](https://github.com/0xAX/linux-insides/blob/master/Theory/asm.md) is useful to you. – Z boson May 02 '16 at 07:13
  • `.L4:` is the table. `.quad .L1` is the first entry, as a 64bit pointer. `[.L4 + rax*8]` is an addressing mode that indexes the table with `rax`. (In AT&T syntax, `.L4(%rax)`, which is probably why gcc's Intel syntax likes to put labels outside the `[]`. Or maybe it's just copying MASM like it does for other aspects of the syntax.) – Peter Cordes May 02 '16 at 14:10
  • Thanks for the link, but it has some massive flaws. It ends up telling you that the correct way to increment a memory operand is to use it as an *input* operand with a `"memory"` clobber. (The actual correct way is as a `"+m"` read-write output operand, without the optimization-defeating `"memory"` clobber. And he should have been using a `"+rm"` constraint, not just one or the other.) Also, the asm doesn't always match the code: it looked like the asm just returned the value, instead of passing it to printf. That's minor, though. – Peter Cordes May 02 '16 at 14:21
  • 1
    re: hex: you mean like this [Godbolt feature-request](https://github.com/mattgodbolt/gcc-explorer/issues/107)? Having gcc do it internally would probably be ideal, maybe submitting a patch to gcc would be better than having godbolt post-process the output. Esp. because it would be useful outside of godbolt.org! – Peter Cordes May 02 '16 at 14:26
  • @PeterCordes, to be honest, I only read the link for about 2 minutes. It was in the top 30 links on https://news.ycombinator.com/ yesterday so I thought it might be good. [Here](https://news.ycombinator.com/item?id=11606518) is the discussion. – Z boson May 03 '16 at 08:13
  • 4
    @Zboson: Note that since gcc 8.1 it is a good idea to add a `default: __builtin_unreachable();` in the `switch`. This leads to [slightly more efficient code](https://gcc.godbolt.org/z/1oJwhk), with one `cmp/ja` less than without the `default` case. – wim Apr 11 '19 at 06:48
0

This is perhaps a bit late though I recently ran into this exact problem and found an alternative solution which used a strictly AVX implementation. If you don't care if unpacked elements are swapped with the last elements of each vector, this could work as well. The following is an AVX version:

inline __m128 left_pack(__m128 val, __m128i mask) noexcept
{
    const __m128i shiftMask0 = _mm_shuffle_epi32(mask, 0xA4);
    const __m128i shiftMask1 = _mm_shuffle_epi32(mask, 0x54);
    const __m128i shiftMask2 = _mm_shuffle_epi32(mask, 0x00);

    __m128 v = val;
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask0);
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask1);
    v = _mm_blendv_ps(_mm_permute_ps(v, 0xF9), v, shiftMask2);
    return v;
}

Essentially, each element in val is shifted once to the left using the bitfield, 0xF9 for blending with it's unshifted variant. Next, both shifted and unshifted versions are blended against the input mask (which has the first non-zero element broadcast across the remaining elements 3 and 4). Repeat this process two more times, broadcasting the second and third elements of mask to its subsequent elements on each iteration and this should provide an AVX version of the _pdep_u32() BMI2 instruction.

If you don't have AVX, you can easily swap out each _mm_permute_ps() with _mm_shuffle_ps() for an SSE4.1-compatible version.

And if you're using double-precision, here's an additional version for AVX2:

inline __m256 left_pack(__m256d val, __m256i mask) noexcept
{
    const __m256i shiftMask0 = _mm256_permute4x64_epi64(mask, 0xA4);
    const __m256i shiftMask1 = _mm256_permute4x64_epi64(mask, 0x54);
    const __m256i shiftMask2 = _mm256_permute4x64_epi64(mask, 0x00);

    __m256d v = val;
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask0);
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask1);
    v = _mm256_blendv_pd(_mm256_permute4x64_pd(v, 0xF9), v, shiftMask2);

    return v;
}

Additionally _mm_popcount_u32(_mm_movemask_ps(val)) can be used to determine the number of elements which remained after the left-packing.

icdae
  • 45
  • 1
  • 4
  • Is that faster than a lookup table of shuffle control vectors for `_mm_shuffle_epi8`? Like `__m128i shuffles[16] = ...` which you index with the `_mm_movemask_ps` result? If you're only doing 4 elements per vector, the lookup table is small enough to be usable and fast. I guess maybe if you only have to do this a couple times, not in a long-running loop, then spending 9 instructions per vector (with 3 of them being blendv which is multi-uop on Intel) might be ok to avoid the possibility of a cache miss on the LUT. – Peter Cordes Jul 24 '21 at 21:42
  • Can the `_mm256_permute4x64_pd(v, 0xF9)` shuffles be replaced with different shuffles of `val` to shorten the dependency chain a bit, making it easier for out-of-order exec to hide the latency? Or do they all need to shuffle the previous blend result? – Peter Cordes Jul 24 '21 at 21:44
  • 1
    I tested with a LUT, similar to Z boson's reply but with `_mm_shuffle_epi8`, and yes, it is significantly faster (at least in my current usage, always profile for your specific case). There will be no out-of-order execution with the final three permutations as the results rely on each previous instruction. I'm certain there should be a way to avoid, or at least reduce, the dependency chain. If I find one then I'll definitely post it. – icdae Jul 24 '21 at 23:33