Best way to do a packed 16 element blend using SSE

Question

I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used.

void packedBlend16(uint8_t mask, uint16_t* dst, uint16_t const* a, uint16_t const* b) {
  for (int i = 0; b < 8; ++i) {
    int const control = mask & (1 << i);
    dst[i] = control ? a[i] : *b++;
  }
}

The tricky part for me is spacing out the elements of b correctly in the vector.

So far, my approach is:

256 entry LUT[mask] to get a shuffle that expands the elements of b in the right place with pshufb
Construct a blend vector from the mask with vpbroadcastw + vpand + vpcmpeqw
pblendvb a with the shuffled elements of b

I suspect that a 256 entry LUT is not the best approach. I could potentially have 2 16 entry LUT's for the upper/lower bits. But you would have to add an offset to the upper LUT based on the popcnt of the lower 4-bits of the mask.

I'm doing 4 of these independently at a time, so I want to maximize throughput, but can accept latency.

Are there other approaches that I could take?

You tagged Intel, so can we assume you have efficient BMI2 `pdep` / `pext` (Haswell and later)? They're slow on AMD, but fast on Intel, and can be used for hardware left-packing: [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240). `pext` is a bit version of AVX512 `vpcompressd`. My answer there only works for 32-bit elements, but it could be worth considering unpacking to 32-bit for `vpermd`, or generating `pshufb` masks with i, i+1 byte indices. — Peter Cordes, May 16 '20 at 19:59
If you had [AVX512VBMI2 `VPCOMPRESSW`](https://github.com/HJLebbink/asm-dude/wiki/VPCOMPRESS) you'd be all set, but that's only in Ice Lake. The reason Intel introduced new instructions for left-packing is that it's hard to do efficiently without HW support, and as you noticed lookup up shuffle controls from a LUT doesn't scale well to wider vectors. :/ — Peter Cordes, May 16 '20 at 20:03
Yeah, we can assume BMI2. VPCOMPRESSW could be very interesting in the future, but not usable for me yet. I'll check out your answer. — Nick, May 16 '20 at 20:36
Would it also be worth trying to stick with scalar code and handle 8-bytes at a time (still with 2-byte width), or go AVX2 and handle 32-bytes at a time with 32-bit width? — Nick, May 16 '20 at 20:42
Yeah with 16-bit elements it's worth considering just using `pext` for left-packing without ever going SIMD. You could consider a LUT of 8-byte `pext` control vectors from 4 bits of mask. To speed unpacking the mask, maybe one `pdep` to expand a 32-bit mask to 64-bit, so you can just `movzx ecx, al` / `movzx edx, ah` / `shr rax, 16` without needing a separate AND for each one? Or just AND with odd/even masks once before shifting. — Peter Cordes, May 16 '20 at 21:20
My 256 entry LUT of shuffles has worked reasonably well. It has worked well enough to achieve my performance goals for my proof of concept. My guess is that I am able to hide the latency because I am doing 4 of these operations at a time. I still have to try the other approaches you've mentioned. — Nick, May 18 '20 at 02:34
Ok, if you do this often enough that (most of) a 256 x 16-byte LUT stays hot enough in L2 or L3, or maybe even L1d, then that's not a *huge* footprint. Some of the cost is going to be paid in cache misses *elsewhere*, though. Tradeoffs like this can't be microbenchmarked easily; a microbenchmark can spend its whole cache footprint on the LUT. But yeah, doing 4 independent LUT lookups helps a lot, giving you memory-level parallelism even if some of them miss in some levels of cache. — Peter Cordes, May 18 '20 at 02:41
Yeah, I'm going to have to think about this more later down the line. Thanks for the advice! — Nick, May 18 '20 at 05:38
You could I guess profile cache miss rates in the real application after dropping in a simple scalar implementation, to get some data without having to write multiple different SIMD versions. (And also overall time, to see how important this is in the grand scheme of things or for sub-tasks that involve it.) — Peter Cordes, May 18 '20 at 05:42
Just for clarification: You want a packed 8x16bit blend? And your loop condition is supposed to be `i<8`? N.B. If you do a LUT-approach for the shuffling, you can use the shuffle-vector as blend-mask as well. Note that if the upper bit of a shuffle index is set, `pshufb` will set the corresponding element to zero -- which is fine, as you will blend-in the `a` element at these positions anyway. Also, if you do two 16-element LUTs, you can make the offset part of the LUT itself (requiring just an `paddb` to combine both LUT results). — chtz, May 18 '20 at 08:02
One side is packed the other side isn't. I'm flexible on the actual vector sizes. I could do a 4x16bit blend (pdep/pext) or a 16x16 bit blend. The loop condition is supposed to be b<8 (fixed). Using the shuffle mask as the blend is a good idea! Thats a good point about the 2 LUTs, I will have to compare that against the 1 big LUT approach. If the speed is similar that is much preferable, since then I don't need to worry about cache effects as much. — Nick, May 19 '20 at 21:21
FYI the single 256-entry LUT approach and using the mask for both the shuffle and blend has worked extremely well. The 2 16-entry LUT approach is ~10-20% slower because I'm processing MBs of data at a time. I imagine that there is a threshold where the 2 16-entry LUT does better, like less than a few 10s of KB. — Nick, May 21 '20 at 23:06
@Nick `b<8` does not make sense, since `b` is a pointer (`*b <8` would work, but I still think you meant `i<8`). Maybe summarize your results as an answer (I would not expect many more ideas here now). — chtz, May 25 '20 at 13:57

Best way to do a packed 16 element blend using SSE

0 Answers0