I would like to implement the following function using SSE. It blends elements from a with packed elements from b, where elements are only present if they are used.
void packedBlend16(uint8_t mask, uint16_t* dst, uint16_t const* a, uint16_t const* b) {
for (int i = 0; b < 8; ++i) {
int const control = mask & (1 << i);
dst[i] = control ? a[i] : *b++;
}
}
The tricky part for me is spacing out the elements of b correctly in the vector.
So far, my approach is:
- 256 entry
LUT[mask]
to get a shuffle that expands the elements of b in the right place withpshufb
- Construct a blend vector from the mask with
vpbroadcastw
+vpand
+vpcmpeqw
pblendvb
a
with the shuffled elements ofb
I suspect that a 256 entry LUT is not the best approach. I could potentially have 2 16 entry LUT's for the upper/lower bits. But you would have to add an offset to the upper LUT based on the popcnt of the lower 4-bits of the mask.
I'm doing 4 of these independently at a time, so I want to maximize throughput, but can accept latency.
Are there other approaches that I could take?