Fast multiplication of k x k boolean matrices, where 8 <= k <= 16

Question

I want to find an as fast as possible way of multiplying two small boolean matrices, where small means, 8x8, 9x9 ... 16x16. This routine will be used a lot, so it needs to be very efficient, so please don't suggest that the straightforward solution should be fast enough.

For the special cases 8x8, and 16x16 I already have fairly efficient implementations, based on the solution found here, where we treat the entire matrix as an uint64_t or uint64_t[4] respectively. On my machine this is roughly 70-80 times faster than the straightforward implementation.

However, in the case of 8 < k < 16, I don't really know how I can leverage any reasonable representation in order to enable such clever tricks as above.

So basically, I'm open for any suggestions using any kind of representation (of the matrices) and function signature. You may assume that this targets either a 32-bit or 64-bit architecture (pick what best suits your suggestion)

@Dvorak It was just to not limit any answers. If you have a very clever way of doing this, but it requires 64-bit, please just use that :) — hakoja, Jan 29 '13 at 10:46
Can you clarify what you mean by multiplying boolean matrices? Are you talking about doing modulo-2 arithmetic? — Oliver Charlesworth, Jan 29 '13 at 10:50
@Oli Yes - the matrices consists only of binary values, so for all operations you can just use bit operations. — hakoja, Jan 29 '13 at 10:54
can you give a sample of the multiplication and result output? After element-wise multiplication of the column vector by the row vector, are you performing a sum (i.e. dot product)? Or are you expecting the result to also be a boolean matrix (i.e. elements are either 1 or 0)? — Josh Petitt, Jan 29 '13 at 17:27
@Josh No the matrices are not (necessarily) sparse, but you can assume they are always invertible (if that is of any help). The expected result is a new k x k boolean matrix yes. I.e. I want a completely normal matrix multiplication of two boolean matrices (i.e. the elements are in GF(2)). Part of the question is finding a good representation of these matrices, enabling efficient computation, so I haven't put any criteria on the types you chose. Neither do I care if you chose column-major or row-major ordering. Do what's easiest for you, I can always extract out the essential ideas anyway :) — hakoja, Jan 29 '13 at 17:37
@hakoja, AFAIK, a "completely normal" matrix multiplication of a "boolean" matrix does not result in a boolean matrix. Consider the matrix multiplication of 2x2 matrix A = [[1, 1], [1, 1]] and B = [[1, 1], [1, 1]], the result of A*B = [[2, 2], [2, 2]], not a "boolean" matrix? I don't completely understand the GF(2) requirement. I googled it, but don't understand how it applies to your problem. Can you show a two sample matrices and the result you expect? — Josh Petitt, Jan 29 '13 at 21:17
@hakoja, in my experience implementing "fast" matrix multiplication is not necessarily about understanding the math as much as understanding the machine, and the problem domain. That's why I'm asking the questions. — Josh Petitt, Jan 29 '13 at 21:19
@hakoja, for example, in the right situation, if you have two unsigned int values that represent a row of A and a column of B, and the problem domain enforces that the values of each element of a_row and b_col are only 0 or 1, and dot(a_row, b_col) always equals either 0 or 1 (this would mean matrices with only certain allowable values), then to multiply a 32 element vector is as simple as (0 < (a_row & b_col)). Very fast, no multiplication operation. — Josh Petitt, Jan 29 '13 at 21:28
@Josh Thank you for showing interest in my problem :) and sorry if I wasn't clear. GF(2) is just another way of saying "do every operation modulo 2", so e.g. addition = XOR and mult. = AND, thus in your example A * B would in fact be [[0,0],[0,0]]. [The wikipedia link](http://en.wikipedia.org/wiki/GF(2)) explains it well. And - yes: you are right that we won't use "ordinary" multiplication, but use the logical AND instead (since they are equivalent). However, my question goes somewhat beyond this. See the link in the OP for an example of how we can exploit that we work with boolean matrices. — hakoja, Jan 29 '13 at 22:49
You may find some inspiration here https://stackoverflow.com/questions/18447321/binary-matrix-multiplication-bit-twiddling-hack — Lu4, Jul 14 '21 at 09:49

Aki Suihkonen · Accepted Answer · 2017-04-20T09:37:26.083

Given two 4x4 matrices a= 0010,0100,1111,0001, b=1100,0001,0100,0100, one could first calculate the transpose b' = 1000,1011,0000,0100.

Then the resulting matrix M(i,j)=a x b mod 2 == popcount(a[i]&b[j]) & 1; // or parity

From that one can notice that the complexity only grows in n^2, as long as the bitvector fits a computer word.

This can be speed up for 8x8 matrices at least, provided that some special permutation and bit selection operations are available. One can iterate exactly N times with NxN bits in a vector. (so 16x16 is pretty much the limit).

Each step consists of accumulating i.e. Result(n+1) = Result(n) XOR A(n) .& B(n), where Result(0) = 0, A(n) is A <<< n, and '<<<' == columnwise rotation of elements and where B(n) copies diagonal elements from the matrix B:

    a b c          a e i          d h c          g b f
B=  d e f  B(0) =  a e i  B(1) =  d h c   B(2) = g b f
    g h i          a e i          d h c          g b f

And after thinking it a bit further, a better option is to ^^^ (row wise rotate) matrix B and select A(n) == column copied diagonals from A:

    a b c         a a a           b b b           c c c 
A=  d e f  A(0) = e e e , A(1) =  f f f,  A(2) =  d d d 
    g h i         i i i           g g g           h h h

EDIT To benefit later readers, I'd propose the full solution for W<=16 bit matrix multiplications in portable C.

#include <stdint.h>
void matrix_mul_gf2(uint16_t *a, uint16_t *b, uint16_t *c)
{
    // these arrays can be read in two successive xmm registers or in a single ymm
    uint16_t D[16];      // Temporary
    uint16_t C[16]={0};  // result
    uint16_t B[16];  
    uint16_t A[16];
    int i,j;
    uint16_t top_row;
    // Preprocess B (while reading from input) 
    // -- "un-tilt" the diagonal to bit position 0x8000
    for (i=0;i<W;i++) B[i]=(b[i]<<i) | (b[i]>>(W-i));
    for (i=0;i<W;i++) A[i]=a[i];  // Just read in matrix 'a'
    // Loop W times
    // Can be parallelized 4x with MMX, 8x with XMM and 16x with YMM instructions
    for (j=0;j<W;j++) {
        for (i=0;i<W;i++) D[i]=((int16_t)B[i])>>15;  // copy sign bit to rows
        for (i=0;i<W;i++) B[i]<<=1;                  // Prepare B for next round
        for (i=0;i<W;i++) C[i]^= A[i]&D[i];          // Add the partial product

        top_row=A[0];
        for (i=0;i<W-1;i++) A[i]=A[i+1];
        A[W-1]=top_row;
    }
    for (i=0;i<W;i++) c[i]=C[i];      // return result
}

Thank you for your answer, however since Sjoerd's was first to suggest the use of transposed matrices, I'll accept his answer. — hakoja, Jan 30 '13 at 16:00
No problem with that -- although I'd consider at the moment the fully parallel version that would use ~W*16 xmm instructions outperforming the W^2*K approach suggested by me in the first revision. — Aki Suihkonen, Jan 30 '13 at 17:01
@Hakoja Feel free to move the accepted answer to this answer. It covers more ground than mine and has example code, so is much more complete. — Sjoerd, Feb 03 '13 at 11:18
@AkiSuihkonen The last should be "c[i]=W[i]" -> "c[i]=C[i]"? Also, know of an Intel optimized routine for W == 4? — Chad Brewbaker, Apr 15 '17 at 21:02

score 5 · Answer 2 · answered Jan 29 '13 at 10:45

5

How about padding it out to the next "clever" (e.g. 8 or 16) size, with all '1' on the diagonal?

answered Jan 29 '13 at 10:45

sheu

5,653
17
32

Then you'll spend a lot of time multiplying zeroes if you pad from 9x9 to 16x16. – John Dvorak Jan 29 '13 at 10:48
5

We're multiplying bit vectors, a machine word at a time. The extra multiplications would be far outweighed by the conditional branching you'd spend handling different sizes. – sheu Jan 29 '13 at 10:50
@sheu Do you think you could please expand a bit on your answer? How exactly should the 16x16 representation of a 9x9 (say) matrix look like? And what should the signature of this function be? Should the padding be the responsibility of the caller or the function itself? – hakoja Jan 29 '13 at 11:19
1

@JanDvorak, not if you use the fact that boolean "multiplication" is just AND. Also, the processor is going to operate on the word size. So any "small" square matrix should use the native word size of the machine, and probably pad the MSBs to avoid having to shift to get your final answer. – Josh Petitt Jan 29 '13 at 17:23

Sjoerd · Answer 3 · 2013-01-29T13:54:22.123

4

Depending on your application, storing both the matrix and its transpose together might help. You will save a lot of time that otherwise would be used to transpose during matrix multiplications, at the expense of some memory and some more operations.

edited Jan 29 '13 at 13:54

answered Jan 29 '13 at 12:41

Sjoerd

6,837
31
44

I believe you are forgetting to sum over some dimension and essentially calculate only a .* b'; – Aki Suihkonen Jan 29 '13 at 13:42
@AkiSuihkonen Nice catch, thanks! That invalidates most of my answer, although I think there is still an advantage by storing the transpose. – Sjoerd Jan 29 '13 at 13:50
Absolutely, just as in my answer below. – Aki Suihkonen Jan 29 '13 at 13:58

paperclip optimizer · Answer 4 · 2019-03-22T21:01:13.687

There is a faster method for multiplying 8x8 matrices using 64-bit multiplication along with some simple bit trickery, which works for either GF[2] or boolean algebra. Assuming the three matrices being packed in 8 consecutive rows of 8 bits inside a 64-bit int each, we can use multiplication to scatter the bits and do the job in just one for loop:

uint64_t mul8x8 (uint64_t A, uint64_t B) {

    const uint64_t ROW = 0x00000000000000FF;
    const uint64_t COL = 0x0101010101010101;

    uint64_t C = 0;

    for (int i=0; i<8; ++i) {
        uint64_t p = COL & (A>>i);
        uint64_t r = ROW & (B>>i*8);
        C |= (p*r); // use ^ for GF(2) instead
    }
    return C;
}

The code for 16x16 is straightfoward if you can afford blocking the rows for improved efficiency. This trick is also used extensively in high-performance linear algebra libraries, and consists in partitioning the matrix into N/M x N/M blocks of MxM submatrices, with M = 2^m chosen to maximize locality in cache. The usual way to deal with N % M != 0 is to pad rows and columns with 0s so one can use the same algorithm for all block multiplications.

We can apply the same ideas to boolean matrices of variable dimension 8 >= N >= 16 as long as we can afford to have the matrices represented internally in a row blocking format. We just assume the matrix is 16x16 and the last 16-N rows and columns are filled with 0s:

void mul16x16 (uint64_t C[2][2], const uint64_t A[2][2], const uint64_t B[2][2]) {

    for (int i=0; i<2; ++i)
        for (int j=0; j<2; ++j)
            C[i][j] = mul8x8(A[i][0],B[0][j])
                    | mul8x8(A[i][1],B[1][j]); // once again, use ^ instead for GF(2)
}

Notice we have done a 16x16 matrix multiplication in just 8x8=64 integer products and some bit operations.

Also mul8x8 can be much improved with modern SSE/AVX vector instructions. In theory it is possible to perform all 8 products in parallel with one AVX512 instruction (we still need to scatter the data to the ZMM register first) and then reduce horizontally using lg2(8) = O(3) instructions.

Excellent! Very useful answer! – alex137 Jan 22 '21 at 14:47 — alex137, Jan 22 '21 at 14:47

Fast multiplication of k x k boolean matrices, where 8 <= k <= 16

4 Answers4

Linked