Vector matrix multiplication, float vector, binary matrix

Question

I'd like to multiply a float vector of size N with a matrix of size NxM.

The matrix is a binary matrix (containing only zero and 1) and is relatively sparse: density of non-zero values is between 1-5%.

Currently, I'm forming this as a dense vector and sparse float matrix multiplication.

But that's just an overkill isn't it?

What if I store the columns of matrix as bitset and then multiplication is simply using bitset for indexing vector and then summing it up.

I assume I could form this as a vectorized operation in SSE/AVX, something like load + and + sum or load + mask + sum

I'd appreciate if you could point me to the right intrinsics for doing this, the main question is what is the best way in dealing with unpacking the bitset?

If you never have more than 5% density, you are likely better of by just storing the indices of the non-zeros and do `vgatherdps` (a single gather is much more expensive than a load+masking, but it should not be 20 times as expensive) -- of course make sure that you have some independent accumulator registers, to avoid latencies. If there is some structure in your density-pattern (e.g., some kind of block-structure), there could also be much better representations for your matrix. I guess the bitset idea won't be better until you have much higher density (I did not benchmark this, though). — chtz, Oct 15 '19 at 12:04
Great point. Actually I was checking various scenarios that could happen and matrices that I’d be dealing with, the density is more towards 5-15%, I’ll update the question. My concern is memory usage now, with row col indices I’d need two ints per nonzero cell, for large N and M, may become a big deal, N is about 100K to 500K and M about 500k to 1M — NULL, Oct 15 '19 at 14:04
With compressed-row-storage you should just need one integer (large enough to hold the highest column-number) per (non-zero) element and one integer (large enough to hold the number of elements) per row. In classical CRS you would also need to store the actual non-zero values, but since they are all `1.0f` in your case you don't need to store them. I.e., memory-wise you are better as long as you have less than about 1/32 non-zeros (for 32bit indices). If your non-zeros are somewhat clustered, some hybrid-approach may work (e.g., store blocks of bitmasks with their coordinates in the matrix). — chtz, Oct 15 '19 at 14:12
Do you still plan to edit/clarify your question? It would especially be interesting if your matrix (partially) had some regular structure -- but that would probably better be a new question. — chtz, Oct 17 '19 at 00:57

Peter Cordes · Accepted Answer · 2019-10-14T23:40:23.700

So each element of your result vector is a masked sum of input vector? And those masks come from columns of the matrix so they're not contiguous bits.

A masked sum using a contiguous bitmap is trivial with AVX512 (just use merge-masked add, or zero-masked loads). With SSE/AVX2 you'd use is there an inverse instruction to the movemask instruction in intel avx2? + _mm256_and_ps. Or some variation on that which optimizes across mask vectors, e.g. with a 32-bit broadcast load and then shifting that for the next step. Instead of doing another unaligned dword broadcast for each byte.

But with your mask bits not contiguous you have a choice:

Do each output vector element separately, with a horizontal sum at the end. Requires gathering bits and making a vector mask. Probably hard except for the M=32 case where the bit stride already lines them up with contiguous 32-bit floats.
accumulate a vector of 4 or 8 output elements, using contiguous groups of 4 or 8 mask bits. So you vectorize over the outer loop, doing broadcast loads in the inner loop over the input vector. USE THIS. You should actually unroll with multiple vector sums to hide FP add latency.

Broadcast-loads like __m256 v = _mm256_set1_ps(invec[i]) are basically free (vbroadcastss is a pure load uop, no ALU shuffle uop). You don't need any other shuffling of floats, just pure vertical SIMD, even at the end of the loop: you just _mm256_storeu_ps into the output vector.

And you're using contiguous groups of mask bits so the usual inverse-movemask Q&As are useful.

  // untested, rough outline of what it might look like

  uint8_t matrix[rows * cols];  // bit matrix in chunks of 8 bits
  float invec[N], outvec[N];    // A normal function will just take pointer inputs.

  constexpr int unroll = 4;
  for(int outpos = 0 ; outpos < M-8*unroll+1 ; outpos += 8 * unroll) {
      __m256 sum0, sum1, sum2, sum3;  //optionally use an array of accumulators, sums[unroll];
      sum0 = sum1 = sum2 = sum3 = _mm256_setzero_ps();
            // optionally peel the first inner iteration to just load+mask without adding to 0.0
      for (int inpos = 0 ; in < N ; in++ ){
          __m256 inv = _mm256_set1_ps(invec[inpos]);
          __m256 mask0 = inverse_movemask(matrix[outpos*stride + inpos + 0]);  // 8 bits -> 8 vector elements
          __m256 mask1 = inverse_movemask(matrix[outpos*stride + inpos + 1]);
          ...

          sum0 = _mm256_add_ps(sum0, _mm256_and_ps(inv, mask0) );  // add in[i] or 0.0 according to mask
          sum1 = _mm256_add_ps(sum1, _mm256_and_ps(inv, mask1) );
          ...
      }
      __m256_storeu_ps(&outvec[outpos + 0*8], sum0);
      __m256_storeu_ps(&outvec[outpos + 1*8], sum1);
      __m256_storeu_ps(&outvec[outpos + 2*8], sum2);
      ...
  }

  not-unrolled __m256 and/or __m128 cleanup for M % (8*unroll) != 0

  cleanup for M % 4 != 0 using __m128 broadcast loads 
    for the last 1..3 rows of masks
    maybe use a masked store (AVX2 vmaskmov) or pad your output vector

Each inner loop iteration masks one float 8 * unroll different ways, and accumulates into the corresponding 8 * unroll different running totals. (Across unroll vectors of 8 floats each.)

This is also great for memory bandwidth

You only ever read each bitmap bit once in a vec*mat product, but the input vector is effectively used M times. Looping over contiguous bitmap rows gives good locality, not requiring any of those cache lines to be loaded more than once.

Even with AVX512 and 2x _mm512_mask_add_ps per clock, 1 bit per FP element added is not much bandwidth for the bitmap loads.

However, you loop over your input vector M/(8*unroll) times. The masked adds for each sum vector use different mask bits but the same broadcasted input float. Since the matrix elements are 32x smaller than the vector elements, this is not bad.

One float loaded per 4x or 8x vaddps instructions is very good computational intensity. Especially without AVX512, where bitmap -> vector mask will cost cycles.

To help even more with cache / memory bandwidth, cache-blocking / loop-tiling for L2 cache size (256kiB) might be possible to help with reuse of input vector elements. But I'm not sure if you can usefully block for both input and output. Unlike a mat*mat product, there's only O(n^2) work to do. Rereading input and just writing one output stream is probably fine, but you could find a middle ground that adds partial results into partial chunks of the output vector. But then aren't reading the bit matrix in one contiguous stream anymore. As long as you stop at cache-line boundaries it's probably fine.

If your NxM matrix happens to have M = 32 then that matches the size of a float exactly, and _mm256_loadu_si256 will get a vector that has the mask bits for outvec[0] in the low bit of every element. And the mask bits for outvec[31] in the high bit. You can use _mm256_blendv_ps to apply them to the input of a sum, and left-shift by 1 to move the next bit up to the top position. (An alternative to vblendvps is psrad by 31 + andps: an arithmetic right shift to broadcast the top bit to all positions).

But this might not be any better than the other way, even for this special case. You can unroll over multiple output elements in different vectors so you can reuse the float vector a few times.

With AVX512F you can just use the matrix rows as __mmask16 values for a masked add like _mm512_mask_add_ps.
sum = _mm512_mask_add_ps(sum, matrix[col*rowstride + row], sum, invec[i]); if matrix is an array of uint16_t.

Or with AVX512BW, kmovq 64 bits of mask into a k register and kshift it down, to match up with an unroll over 4 vector accumulators. Unfortunately kmov k, [mem] is 2 uops on Skylake-X: load + port 5, not just a load uop that can write to mask regs. So one load an 3x unpack with kshift is pure win vs. 4x kmovw k1, [mem] / kmovw k2, [mem+2] etc. No way to get each 16 bits of mask data at the bottom of a k register without a port5 uop for each one. So it competes with 512-bit FMA/add/mul throughput on SKX cores that have 2 FMA units, otherwise just front-end throughput cost.

+1 This is perfect! I could indeed form the bits as contiguous when reading matrix from file. I forgot to mention this is used in an optimization problem and there is one large matrix (N is 100K to 500K, and M is about 500K to 1M), and it won't change at all during the run. The platform this runs on is highly heterogenous, all have AVX2 upwards, many without AVX512. So do you think I should go with the solutions in paragraph "A masked sum using a contiguous bitmap..." ? — NULL, Oct 14 '19 at 19:27
@NULL: don't forget to upvote, then. :P You could make an AVX512 version and do runtime dispatch; 100k x 500k is plenty big to amortize dispatch overhead. But yes, I highly recommend computing several SIMD vectors of `outvec[i + 0..7]` or `0..15` in each pass over to vector. That's big enough to consider loop tiling (aka cache blocking), like only doing a part of the input vector that fits in L2 cache and having later chunks add their partial result to the output vector instead of just store. — Peter Cordes, Oct 14 '19 at 19:56
@NULL: added a rough outline of what the loop should look like. — Peter Cordes, Oct 14 '19 at 20:40
Two quick questions: 1- in the inner loop _mm256_set1_ps broadcasts one float, but shouldn't we populate it with 8 floats or such? Because otherwise that would be redundant, isn't it? Also inverse_movemask is doing an 8bit (so 8 element of a col in a matrix), then and+add would need 8 floats? 2- Stride here is just N right? — NULL, Oct 14 '19 at 23:01
@NULL: It's one float used with 8 mask bits per vector. Remember we unroll / vectorize over the *outer* loop, doing more outvec[i] elements per iteration. Each inner loop iteration masks one float `8 * unroll` different ways, and accumulates into the corresponding `8 * unroll` different running totals. Stride is `N/8` rounding up to a full byte, so each row of bitmaps is byte-aligned. Or rounding up more so each row is dword aligned if you want. — Peter Cordes, Oct 14 '19 at 23:41

Vector matrix multiplication, float vector, binary matrix

1 Answers1

This is also great for memory bandwidth