So each element of your result vector is a masked sum of input vector? And those masks come from columns of the matrix so they're not contiguous bits.
A masked sum using a contiguous bitmap is trivial with AVX512 (just use merge-masked add, or zero-masked loads). With SSE/AVX2 you'd use is there an inverse instruction to the movemask instruction in intel avx2? + _mm256_and_ps
. Or some variation on that which optimizes across mask vectors, e.g. with a 32-bit broadcast load and then shifting that for the next step. Instead of doing another unaligned dword broadcast for each byte.
But with your mask bits not contiguous you have a choice:
- Do each output vector element separately, with a horizontal sum at the end. Requires gathering bits and making a vector mask. Probably hard except for the M=32 case where the bit stride already lines them up with contiguous 32-bit floats.
- accumulate a vector of 4 or 8 output elements, using contiguous groups of 4 or 8 mask bits. So you vectorize over the outer loop, doing broadcast loads in the inner loop over the input vector. USE THIS. You should actually unroll with multiple vector sums to hide FP add latency.
Broadcast-loads like __m256 v = _mm256_set1_ps(invec[i])
are basically free (vbroadcastss
is a pure load uop, no ALU shuffle uop). You don't need any other shuffling of floats, just pure vertical SIMD, even at the end of the loop: you just _mm256_storeu_ps
into the output vector.
And you're using contiguous groups of mask bits so the usual inverse-movemask Q&As are useful.
// untested, rough outline of what it might look like
uint8_t matrix[rows * cols]; // bit matrix in chunks of 8 bits
float invec[N], outvec[N]; // A normal function will just take pointer inputs.
constexpr int unroll = 4;
for(int outpos = 0 ; outpos < M-8*unroll+1 ; outpos += 8 * unroll) {
__m256 sum0, sum1, sum2, sum3; //optionally use an array of accumulators, sums[unroll];
sum0 = sum1 = sum2 = sum3 = _mm256_setzero_ps();
// optionally peel the first inner iteration to just load+mask without adding to 0.0
for (int inpos = 0 ; in < N ; in++ ){
__m256 inv = _mm256_set1_ps(invec[inpos]);
__m256 mask0 = inverse_movemask(matrix[outpos*stride + inpos + 0]); // 8 bits -> 8 vector elements
__m256 mask1 = inverse_movemask(matrix[outpos*stride + inpos + 1]);
...
sum0 = _mm256_add_ps(sum0, _mm256_and_ps(inv, mask0) ); // add in[i] or 0.0 according to mask
sum1 = _mm256_add_ps(sum1, _mm256_and_ps(inv, mask1) );
...
}
__m256_storeu_ps(&outvec[outpos + 0*8], sum0);
__m256_storeu_ps(&outvec[outpos + 1*8], sum1);
__m256_storeu_ps(&outvec[outpos + 2*8], sum2);
...
}
not-unrolled __m256 and/or __m128 cleanup for M % (8*unroll) != 0
cleanup for M % 4 != 0 using __m128 broadcast loads
for the last 1..3 rows of masks
maybe use a masked store (AVX2 vmaskmov) or pad your output vector
Each inner loop iteration masks one float 8 * unroll
different ways, and accumulates into the corresponding 8 * unroll
different running totals. (Across unroll
vectors of 8 floats each.)
This is also great for memory bandwidth
You only ever read each bitmap bit once in a vec*mat product, but the input vector is effectively used M
times. Looping over contiguous bitmap rows gives good locality, not requiring any of those cache lines to be loaded more than once.
Even with AVX512 and 2x _mm512_mask_add_ps
per clock, 1 bit per FP element added is not much bandwidth for the bitmap loads.
However, you loop over your input vector M/(8*unroll)
times. The masked adds for each sum vector use different mask bits but the same broadcasted input float
. Since the matrix elements are 32x smaller than the vector elements, this is not bad.
One float loaded per 4x or 8x vaddps
instructions is very good computational intensity. Especially without AVX512, where bitmap -> vector mask will cost cycles.
To help even more with cache / memory bandwidth, cache-blocking / loop-tiling for L2 cache size (256kiB) might be possible to help with reuse of input vector elements. But I'm not sure if you can usefully block for both input and output. Unlike a mat*mat product, there's only O(n^2) work to do. Rereading input and just writing one output stream is probably fine, but you could find a middle ground that adds partial results into partial chunks of the output vector. But then aren't reading the bit matrix in one contiguous stream anymore. As long as you stop at cache-line boundaries it's probably fine.
If your NxM
matrix happens to have M = 32
then that matches the size of a float
exactly, and _mm256_loadu_si256
will get a vector that has the mask bits for outvec[0]
in the low bit of every element. And the mask bits for outvec[31]
in the high bit. You can use _mm256_blendv_ps
to apply them to the input of a sum, and left-shift by 1 to move the next bit up to the top position. (An alternative to vblendvps
is psrad
by 31 + andps
: an arithmetic right shift to broadcast the top bit to all positions).
But this might not be any better than the other way, even for this special case. You can unroll over multiple output elements in different vectors so you can reuse the float vector a few times.
With AVX512F you can just use the matrix rows as __mmask16
values for a masked add like _mm512_mask_add_ps
.
sum = _mm512_mask_add_ps(sum, matrix[col*rowstride + row], sum, invec[i]);
if matrix
is an array of uint16_t
.
Or with AVX512BW, kmovq
64 bits of mask into a k
register and kshift
it down, to match up with an unroll over 4 vector accumulators. Unfortunately kmov k, [mem]
is 2 uops on Skylake-X: load + port 5, not just a load uop that can write to mask regs. So one load an 3x unpack with kshift
is pure win vs. 4x kmovw k1, [mem]
/ kmovw k2, [mem+2]
etc. No way to get each 16 bits of mask data at the bottom of a k
register without a port5 uop for each one. So it competes with 512-bit FMA/add/mul throughput on SKX cores that have 2 FMA units, otherwise just front-end throughput cost.