Intrinsics for binary matrix vector multiplication

Question

I am trying to implement a matrix vector multiplication over a binary field. The vector x is of dimension 1xa and the matrix M is of dimension axb and the the result y = a * M is of size 1xb. Right now, I implemented it such that x and M are of type uint8_t*, i.e., I concatenate the columns of M as they are also accessed successively. The function looks like:

void mul(uint8_t M, size_t a, size_t b, uint8_t* x, uint8_t* y) {
    uint8_t val;
    uint8_t *ptr;
    for(size_t i = 0; i < b; i++) {
        val = 0;
        ptr = M + i * a;
        for(size_t j = 0; j < a; j++) {
            val ^= (x[j] & *ptr++);
        }
        y[i] = bit;
    }
}

M and x have been allocated by the caller as

M = malloc(sizeof(uint8_t) * a * b);
x = malloc(sizeof(uint8_t) * a);
y = malloc(sizeof(uint8_t) * b);

As this routine is called billion of times, I need to optimize the shit out of it ;) To do so, I was thinking of

instead of representing each 0/1 as a separate uint8_t (i.e., 8 bits) I could pack all bits in "x" and "M" into arrays of uint64_t of much smaller size, e.g., ap and Mp, where

ap = (size_t) ceil ((double) a / 64); Mp = (size_t) ceil ((double) (a*b) / 64);

using vector intrinsics.

So far, I accomplished the (left aligned) packing (with proper alignment) of M and the multiplication as

typedef uint64_t word;
#define WORD_BITS      (CHAR_BIT * sizeof (word))

void mul_fast(word *M, size_t Mlen, word *x, size_t xlen, size_t b, word *y) {

    for(size_t i = 0; i < Mlen; i++) {
        y[i/xlen] ^= (M[i] & x[i % xlen]);
    }
    for(size_t i = 0; i < b; i++) {
        y[i] = __builtin_popcountll(y[i]) & 1;
    }
}

However, it turns out the above is much slower then the straight-forward implementation of mul().

Do you have any ideas or references? I am not an assembler expert, so comparing the output of gcc -S does not tell me much :/

Thank you, best regards, Tom.

I am not a super expert but whenever I see a `%` or `/` in code that is supposed to be over-optimized I get suspicious. Division and modulus are terribly slow, at a more fundamental level then the language as far as I know. — kabanus, Jan 03 '19 at 19:20
@kabanus: if `ulen` is a compile-time constant power of 2, it's fine. (`size_t` is an unsigned type so it really is just a shift and AND. Signed division / remainder has different rounding semantics than shift and mask, so it takes a few extra instructions). But if it's a runtime variable then yeah, you're really screwed. — Peter Cordes, Jan 03 '19 at 19:24
@PeterCordes I see, thanks for enlightening me, I didn't even think about that, and good point. — kabanus, Jan 03 '19 at 19:26
related: [Large (0,1) matrix multiplication using bitwise AND and popcount instead of actual int or float multiplies?](https://stackoverflow.com/q/45376006) maybe a duplicate, I think it's doing exactly the same problem. Packing from bytes to bitmaps with SSE2 or AVX2 should be done with `_mm_movemask_epi8` as a building block. — Peter Cordes, Jan 03 '19 at 19:27
Thanks for the references! I think I understand how I can use _mm_movemask_epi8 for creating the bitmaps from the respective uint8_t arrays. However, I still stuck with the actual matrix-vector multiplication: Contrary to the other solutions, I would prefer to have my matrix simply as one dimensional vector of bitmaps. Shouldn't this already be more efficient? Besides, the binary matrices aren't that large, e.g. typically 200x1000 bits, i.e., ceil(200*1000/64 ) = 3125 uin64_t integers. Can you see what the problem is with the code above? — tomseidel1, Jan 03 '19 at 21:45
Is your `mul` able to inline into a function where `xlen` is a compile-time-constant power of 2? If not, then kabanus is right and `/` and `%` are the main problems. Use a nested loop instead of trying to play modulo tricks to get wrap-around in a single iteration. Also, compilers might not do a great job at hoisting `y[i/xlen] ^= ...` into a register for `xlen` iterations instead of using memory. **You really need the asm to have nested loops for efficiency, so you should write your C that way.** You can certainly do 2d indexing into a 1d array in those loops, though. — Peter Cordes, Jan 03 '19 at 21:59
Wow! Changing the xlen (temporarily) to 2 (which is the right value) for the current case reduced the running time (with -O3) by almost a factor of 2! ` — tomseidel1, Jan 03 '19 at 22:59

score 1 · Answer 1 · answered Jan 03 '19 at 23:04

The relevant difference in the assembler output is:

.L26: - movq %r10, %rax - xorl %edx, %edx - divq %rcx - movq (%r11,%rdx,8), %rdx - andq (%rdi,%r10,8), %rdx - addq $1, %r10 - xorq %rdx, (%r9,%rax,8) - cmpq %r10, %rsi + movq %rax, %rcx + movq %rax, %r10 + andl $1, %ecx + shrq %r10 + movq (%rdx,%rcx,8), %rcx + andq (%rdi,%rax,8), %rcx + addq $1, %rax + xorq %rcx, (%r9,%r10,8) + cmpq %rax, %rsi Can you see what the culprit was?

Replacing `div` with shift/AND from a compile-time-constant `xlen` makes it about 30 to 90 times faster. Integer division is a *massive* bottleneck, especially for 64-bit integers (instead of 32-bit). [C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](https://stackoverflow.com/q/40354978)/. — Peter Cordes, Jan 04 '19 at 02:25

Intrinsics for binary matrix vector multiplication

1 Answers1