This (modulo 2) binary matrix multiplication algorithm seems to underperform. What can I do better?

Question

The question has changed since its initial posting as I've chased down a few leads. At this point, I'd say that I'm really looking for the following answers:

Can a significant amount of time be saved by replacing addition/multiplication followed by a modulo 2 operation with and/logical_xor (assuming that the total number of such operations is kept the same)? If not, then why not? ANSWER: some time can indeed be saved, but it's arguable whether that amount is "significant".
Where can I read more about the specific approach taken by the BLAS matrix multiplication underlying numpy? Ideally, I'd like a source that doesn't require deciphering the FORTRAN code forged by the sages of the past. ANSWER: The original paper proposing the BLAS matrix multiplication algorithms used today can be found here.

I've left my question in its original form below for posterity.

The following are two algorithms for multiplying binary matrices (i.e. taking the "dot" product) modulo 2. The first ("default") approach just uses numpy matrix-multiplication, then reduces modulo 2. The second ("alternative") approach attempts to speed things up by replacing the addition operation with an xor operation.

import timeit
import numpy as np
import matplotlib.pyplot as plt

def mat_mult_1(A,B):
    return A@B%2

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A[:,:,None]&B[None,:,:],axis = 1)

Contrary to my expectations, the alternative approach seems to take about 4 times longer than the default for products of larger binary matrices. Why is that? Is there some way I could speed up my alternative approach?

Here's the script I used to test the above two methods

n_vals = np.arange(5,205,5)
times = []

for n in n_vals:
    s_1 = f"mat_mult_1(np.random.randint(2,size = ({n},{n}))\
                        ,np.random.randint(2,size = ({n},{n})))"
    s_2 = f"mat_mult_2(np.random.randint(2,size = ({n},{n})),\
                        np.random.randint(2,size = ({n},{n})))"

    times.append((timeit.timeit(s_1, globals = globals(), number = 100),
              timeit.timeit(s_2, globals = globals(), number = 100)))

and here are two plots of the results.

Minor updates:

I was able to test these out for larger matrices (up to 1000x1000) and get a better sense of the asymptotics here. It indeed seems to be the case that the "default" algorithm here is O(n^2.7), whereas the alternative is the expected O(n³) (the observed slopes were 2.703 and 3.133, actually).

I also checked how the alternative algorithm compared to the following implementation of "schoolbook" matrix multiplication followed by a mod operation.

def mat_mult_3(A,B):
    return np.sum(A[:,:,None]*B[None,:,:],axis = 1)%2

I was very surprised to find that this also does better than the and/xor based method!

In response to Michael's comment, I replaced mat_mult_2 with the following:

def mat_mult_2(A,B):
    return np.logical_xor.reduce(A.astype(bool)[:,:,None]  
            & B.astype(bool)[None,:,:],axis = 1).astype(int)

This arguably still places an undue burden of type conversion on the method, but sticking to multiplication between boolean matrices didn't significantly change performance. The result is that mat_mult_2 now (marginally) outperforms mat_mult_3, as expected.

In response to Harold's comment: another attempt to get the asymptotics of the @ method. My device doesn't seem to be able to handle multiplication with n much greater than 2000.

The observed slope here is 2.93.

I did find [this post](https://stackoverflow.com/q/10442365/2476977), which is certainly relevant. Some factors that come into play here: numpy uses existing codes for [BLAS routines](https://netlib.org/blas/) from ATLAS. At the very least, it seems that numpy isn't using the "schoolbook" algorithm for matrix multiplication; rather it is using something with better asymptotics; that at least explains why the computation time ratio seems worse for larger matrices. What matrix multiplication algorithm is it using, though? — Ben Grossmann, Oct 27 '22 at 16:44
As you note `@` is using highly optimized BLAS routines - at least where possible. Equivalents using broadcasted element-multiply and sum aren't close in speed. And don't assume that boolean operations like `logical_or/and` are faster than addition/multiplication. — hpaulj, Oct 27 '22 at 17:01
@hpaulj Thanks for the input. Do you have any idea about the specifics of the multiplication algorithm used in numpy, or where I could find out more about it? My suspicion is that they're using the [SGEMM method documented here](https://netlib.org/lapack/explore-html/db/dc9/group__single__blas__level3_gafe51bacb54592ff5de056acabd83c260.html). I have no experience coding in FORTRAN though, so I've been looking for a more human-readable explanation of what's under the hood there. — Ben Grossmann, Oct 27 '22 at 17:09
@hpaulj And yes, I did assume that the boolean operations would be significantly faster than addition/multiplication in addition to avoiding the `%2` operation (which I suppose could also have been done bitwise...). It's surprising to hear otherwise. — Ben Grossmann, Oct 27 '22 at 17:12
In my benchmarks with shape (200,200), *mat_mult_2* is ~4x faster if arrays are cast to `bool`. *mat_mult_1* is ~5x faster when cast to `np.float32`. — Michael Szczesny, Oct 27 '22 at 18:07
@Michael Good catch! So I guess it's the implicit casting between int and bool that might be eating up the time in mat_mult_2, then — Ben Grossmann, Oct 27 '22 at 18:12
@Michael I just realized I missed a part of your comment: is floating point multiplication really that much faster than integer multiplication? — Ben Grossmann, Oct 28 '22 at 01:31
AFAIK, `numpy` doesn't use BLAS calls for integer matmul, but there has been a lot of work to optimize known bottlenecks lately. [Google colab](https://colab.research.google.com/drive/1SMsxGx_-nquVYcS9tMQ9ymMbtdwWacUw?usp=sharing) still uses `numpy 1.21.6` and `OpenBLAS`. — Michael Szczesny, Oct 28 '22 at 03:44
As an actual speed tip, perhaps the `galois` package helps. It claims to be faster, but I never tried that. — harold, Nov 04 '22 at 13:10
@harold This is especially interesting since this whole line of investigation started when I was thinking about [implementing operations in `GF(2**8)`](https://stackoverflow.com/a/74214835/2476977). Thanks for the tip! — Ben Grossmann, Nov 04 '22 at 13:23

hpaulj · Answer 1 · 2022-10-27T18:40:34.043

For a modest n=10 lets compare some alternatives:

Using @ and modulus:

In [15]: timeit A@A%2
8.1 µs ± 118 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

Your alternative:

In [16]: timeit np.logical_xor.reduce(A[:,:,None]&A[None,:,:],axis = 1)
25 µs ± 1.05 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

The @ equivalent:

In [17]: timeit np.sum(A[:,:,None]&A[None,:,:], axis=1)%2
33.2 µs ± 65.7 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

So the logical operations are somewhat faster, but not drastically so.

And to get an idea of how much time the modulus step takes - about 4us.

In [18]: timeit np.sum(A[:,:,None]&A[None,:,:], axis=1)
29.6 µs ± 113 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

In [19]: timeit A@A
4.52 µs ± 11.6 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each)

So in [15], the @ and modulus take about the same time.

edit

In [27]: timeit np.sum(A[:,:,None]*A[None,:,:], axis=1)
28.9 µs ± 81.5 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each)

A nitpick is that your `@` equivalent should have `A[:,:,None]*A[None,:,:]` instead of &, but as I've been finding out that probably won't make much of a difference. The fact that `@` and `%` would take the same amount of time is very surprising! Thanks for the answer. — Ben Grossmann, Oct 27 '22 at 17:47

score 0 · Accepted Answer · answered Oct 28 '22 at 01:35

0

It looks like I mostly answered my own question. Here's a summary of what I found.

One way the method I proposed falls short of the numpy method is in its asymptotic complexity. Whereas my method follows the naive AKA "schoolbook" algorithm of matrix multiplication, numpy pulls its approach from the BLAS routines. My best guess is that numpy is using SGEMM method, which to my limited understanding based on some quick googling and article skimming seems to be a variant of the Strassen algorithm for matrix multiplication. So, where my method does O(n³) operations (for a product of two binary nxn matrices), numpy's method does O(n^2.8) (which is roughly borne out by my observations).
Another way my method falls short is the repeated implicit type conversions that occur when calling boolean methods on an array of integers. This can be avoided by using boolean arrays as the algorithm input.
The result, accounting for these discrepancies, is this: if the schoolbook algorithm is applied but addition and multiplication are replaced by XOR and AND, then (according to my trials) the computation time is reduced by about 20%. This isn't nothing, but less than I was expecting.

answered Oct 28 '22 at 01:35

Ben Grossmann

4,387
1
12
16

Most BLAS implementations don't use Strassen, though there are some that do. Even when it would be faster (which isn't always), it has poor numerical guarantees (unless some extra tricks are used that cost extra time too). Strassen is completely safe for finite fields, so *you* can use it for your own implementation. – harold Nov 03 '22 at 02:18
@harold Interesting. Do you know what those BLAS implementation use, if not Strassen, to get under O(n^3)? – Ben Grossmann Nov 03 '22 at 02:23
They don't get under O(n^3), but they approach it, which requires significant engineering effort already (I recommend reading "Anatomy of High-Performance Matrix Multiplication", by the same person who wrote GotoBLAS). I've seen your benchmarks that hint at some O(n^2.7) thing going on, but since the matrixes only went up to 1kx1k that might also be explained by the relative overhead going down as n goes up. Maybe you are really using a BLAS that uses Strassen, but that would surprise me. They exist, but it would be a strange default. – harold Nov 03 '22 at 03:02
@harold Thanks for the recommendation! I found that paper, but I haven't gotten the chance to dig into it; I suppose your comment is an excuse to prioritize it a bit. It shouldn't be too difficult to time out some larger matrices if I let my computer run overnight... I'll update my question accordingly if I get around to that. – Ben Grossmann Nov 03 '22 at 03:22

This (modulo 2) binary matrix multiplication algorithm seems to underperform. What can I do better?

2 Answers2

edit