7

I want to multiply a 8x8 binary matrix represented as a unsigned 64 bit integer by a 8 bit vector represented by a unsigned char. However, due to some other issues the matrix must be ordered by columns, ergo there's no easy matching of bytes for easy multiplication.

Any idea how to speed up such a calculation? Every operation counts for I need billions of such calculations made.

The multiplications are made over a 2 element field (F-2).

osgx
  • 90,338
  • 53
  • 357
  • 513
Kornel Kisielewicz
  • 55,802
  • 15
  • 111
  • 149
  • Can you give an example of (slow) code? What is your current speed? Do you target the modern x86/x86_64? – osgx Jun 30 '11 at 16:33
  • running on 64 bit architecture -- the slow code is a direct translation from F-7, so it multiplies bit by bit. – Kornel Kisielewicz Jun 30 '11 at 16:47
  • 1
    Is the operation `result= vector*matrix` or `result= matrix*vector`? Posting the naive version would help, because I think you can reduce it to `output[bit]= vector[bit] && matrix_row[bit]!=0`. – MSN Jun 30 '11 at 16:55
  • @MSN, the whole point was to avoid `[bit]` indexing. – Kornel Kisielewicz Jul 01 '11 at 13:26
  • @Kornel, right, but if you are just doing that you can then use SWAR techniques to do bit operations in parallel. (SIMD Within A Register.) Which is what Peter G. Did. – MSN Jul 01 '11 at 16:58
  • @MSN, no need anymore, I did the calculations already (it was checking a set of graphs for a math paper) -- still, thank you :) – Kornel Kisielewicz Jul 01 '11 at 17:48

2 Answers2

8

With this matrix and vector representation, it helps to do matrix multiplication this way:

(col1 ... col8) * (v1 ... v8)T = col1 * v1 + ... + col8 * v8

where matrix A = (col1 ... col8)

and column vector v = (v1 ... v8)T

Thinking this further, you can do all multiplications at once if you inflate the 8-bit vector to a 64-bit vector by repeating every bit 8 times and then calculating P = A & v_inflated. The only thing left then, is the addition (i.e. XOR) of the products.

A simple approach for XORing the products is.

uint64_t P = calculated products from text above;
uint64_t sum = 0;
for( int i = 8; i; --i )
{
   sum ^= P & 0xFF;
   P >> 8;  
}
Peter G.
  • 14,786
  • 7
  • 57
  • 75
  • 3
    You can optimize sum further with: `P^= P>>32; P^= P>>16; P^= P>>8; sum= P & 0xff;` – MSN Jun 30 '11 at 20:45
6

You ONLY HAVE 256 vectors! Use lookup tables to generate the right bitmasks, then your logic will be something like

output_bit_n = bool (matrix [n] & lookup [vector])

In other words, your lookup table can transpose an 8-bit value into the 64-bit world.

You can efficiently pack this into the result with rotate-with-carry instructions if the compiler isn't smart enough to optimise (value<<=1)|=result.

spraff
  • 32,570
  • 22
  • 121
  • 229