Binary matrix vector multiplication

Question

I want to multiply a 8x8 binary matrix represented as a unsigned 64 bit integer by a 8 bit vector represented by a unsigned char. However, due to some other issues the matrix must be ordered by columns, ergo there's no easy matching of bytes for easy multiplication.

Any idea how to speed up such a calculation? Every operation counts for I need billions of such calculations made.

The multiplications are made over a 2 element field (F-2).

Can you give an example of (slow) code? What is your current speed? Do you target the modern x86/x86_64? — osgx, Jun 30 '11 at 16:33
running on 64 bit architecture -- the slow code is a direct translation from F-7, so it multiplies bit by bit. — Kornel Kisielewicz, Jun 30 '11 at 16:47
Is the operation `result= vector*matrix` or `result= matrix*vector`? Posting the naive version would help, because I think you can reduce it to `output[bit]= vector[bit] && matrix_row[bit]!=0`. — MSN, Jun 30 '11 at 16:55
@Kornel, right, but if you are just doing that you can then use SWAR techniques to do bit operations in parallel. (SIMD Within A Register.) Which is what Peter G. Did. — MSN, Jul 01 '11 at 16:58
@MSN, no need anymore, I did the calculations already (it was checking a set of graphs for a math paper) -- still, thank you :) — Kornel Kisielewicz, Jul 01 '11 at 17:48

Peter G. · Accepted Answer · 2011-06-30T17:30:22.530

With this matrix and vector representation, it helps to do matrix multiplication this way:

(col₁ ... col₈) * (v₁ ... v₈)^T = col₁ * v₁ + ... + col₈ * v₈

where matrix A = (col₁ ... col₈)

and column vector v = (v₁ ... v₈)^T

Thinking this further, you can do all multiplications at once if you inflate the 8-bit vector to a 64-bit vector by repeating every bit 8 times and then calculating P = A & v_inflated. The only thing left then, is the addition (i.e. XOR) of the products.

A simple approach for XORing the products is.

uint64_t P = calculated products from text above;
uint64_t sum = 0;
for( int i = 8; i; --i )
{
   sum ^= P & 0xFF;
   P >> 8;  
}

You can optimize sum further with: `P^= P>>32; P^= P>>16; P^= P>>8; sum= P & 0xff;` — MSN, Jun 30 '11 at 20:45

score 6 · Answer 2 · answered Jun 30 '11 at 16:47

6

You ONLY HAVE 256 vectors! Use lookup tables to generate the right bitmasks, then your logic will be something like

output_bit_n = bool (matrix [n] & lookup [vector])

In other words, your lookup table can transpose an 8-bit value into the 64-bit world.

You can efficiently pack this into the result with rotate-with-carry instructions if the compiler isn't smart enough to optimise (value<<=1)|=result.

answered Jun 30 '11 at 16:47

spraff

32,570
22
121
229

+1: while this is helpful advice it's the one with the column based multiplication that made the most difference, still - thank you! – Kornel Kisielewicz Jul 01 '11 at 13:25

Binary matrix vector multiplication

2 Answers2

Linked