I am trying to implement a matrix vector multiplication over a binary field. The vector x is of dimension 1xa and the matrix M is of dimension axb and the the result y = a * M is of size 1xb. Right now, I implemented it such that x and M are of type uint8_t*, i.e., I concatenate the columns of M as they are also accessed successively. The function looks like:
void mul(uint8_t M, size_t a, size_t b, uint8_t* x, uint8_t* y) {
uint8_t val;
uint8_t *ptr;
for(size_t i = 0; i < b; i++) {
val = 0;
ptr = M + i * a;
for(size_t j = 0; j < a; j++) {
val ^= (x[j] & *ptr++);
}
y[i] = bit;
}
}
M and x have been allocated by the caller as
M = malloc(sizeof(uint8_t) * a * b);
x = malloc(sizeof(uint8_t) * a);
y = malloc(sizeof(uint8_t) * b);
As this routine is called billion of times, I need to optimize the shit out of it ;) To do so, I was thinking of
- instead of representing each 0/1 as a separate uint8_t (i.e., 8 bits) I could pack all bits in "x" and "M" into arrays of uint64_t of much smaller size, e.g., ap and Mp, where
ap = (size_t) ceil ((double) a / 64); Mp = (size_t) ceil ((double) (a*b) / 64);
- using vector intrinsics.
So far, I accomplished the (left aligned) packing (with proper alignment) of M and the multiplication as
typedef uint64_t word;
#define WORD_BITS (CHAR_BIT * sizeof (word))
void mul_fast(word *M, size_t Mlen, word *x, size_t xlen, size_t b, word *y) {
for(size_t i = 0; i < Mlen; i++) {
y[i/xlen] ^= (M[i] & x[i % xlen]);
}
for(size_t i = 0; i < b; i++) {
y[i] = __builtin_popcountll(y[i]) & 1;
}
}
However, it turns out the above is much slower then the straight-forward implementation of mul().
Do you have any ideas or references? I am not an assembler expert, so comparing the output of gcc -S does not tell me much :/
Thank you, best regards, Tom.