Multi-threaded integer matrix multiplication in NumPy/SciPy

Question

Doing something like

import numpy as np
a = np.random.rand(10**4, 10**4)
b = np.dot(a, a)

uses multiple cores, and it runs nicely.

The elements in a, though, are 64-bit floats (or 32-bit in 32-bit platforms?), and I'd like to multiply 8-bit integer arrays. Trying the following, though:

a = np.random.randint(2, size=(n, n)).astype(np.int8)

results in the dot product not using multiple cores, and thus running ~1000x slower on my PC.

array: np.random.randint(2, size=shape).astype(dtype)

dtype    shape          %time (average)

float32 (2000, 2000)    62.5 ms
float32 (3000, 3000)    219 ms
float32 (4000, 4000)    328 ms
float32 (10000, 10000)  4.09 s

int8    (2000, 2000)    13 seconds
int8    (3000, 3000)    3min 26s
int8    (4000, 4000)    12min 20s
int8    (10000, 10000)  It didn't finish in 6 hours

float16 (2000, 2000)    2min 25s
float16 (3000, 3000)    Not tested
float16 (4000, 4000)    Not tested
float16 (10000, 10000)  Not tested

I understand NumPy uses BLAS, which doesn't support integers, but if I use the SciPy BLAS wrappers, ie.

import scipy.linalg.blas as blas
a = np.random.randint(2, size=(n, n)).astype(np.int8)
b = blas.sgemm(alpha=1.0, a=a, b=a)

the computation is multi-threaded. Now, blas.sgemm runs with exactly the same timing as np.dot for float32's, but for non-floats it converts everything to float32 and outputs floats, which is something np.dot doesn't do. (In addition, b is now in F_CONTIGUOUS order, which is a lesser issue).

So, if I want to do integer matrix multiplication, I have to do one of the following:

Use NumPy's painfully slow np.dot and be glad I get to keep the 8-bit integers.
Use SciPy's sgemm and use up 4x memory.
Use Numpy's np.float16 and only use up 2x memory, with the caveat that np.dot is much slower on float16 arrays than on float32 arrays, more so than int8.
Find an optimized library for multi-threaded integer matrix multiplication (actually, Mathematica does this, but I'd prefer a Python solution), ideally supporting 1-bit arrays, although 8-bit arrays is also fine... (I'm actually aiming to do multiplication of matrices over the finite field Z/2Z, and I know I can do this with Sage, which is quite Pythonic, but, again, is there something strictly Python?)

Can I follow option 4? Does such a library exist?

Disclaimer: I'm actually running NumPy + MKL, but I've tried a similar test on vanilly NumPy, with similar results.

About your option n°4, maybe you could have a look on [PyCuda](https://pypi.python.org/pypi/pycuda) or on [Theano](http://deeplearning.net/software/theano/) ? They allow large operations to be done on the GPU (with an easy interface with numpy) an quite good performances. — mgc, Jan 30 '16 at 12:04
Oh yeah, I'm using both! While PyCUDA doesn't support float16 (yet?), they both do the job. I'm looking for a CPU implementation, though. — étale-cohomology, Jan 30 '16 at 12:09
As a possible answer to option 4, https://bitbucket.org/malb/m4ri looks interesting. "M4RI is a library for fast arithmetic with dense matrices over F2." I guess that's what Sage is already using, but I don't see any reason why you couldn't use it directly from Python, with a suitable Cython wrapper. (In fact, you might be able to find such a wrapper already in the Sage sources.) — Mark Dickinson, Jan 31 '16 at 10:21
Nobody mentioned `numpy.einsum` yet, but that might be a good option 5. — , Jan 31 '16 at 22:22
Note that you will need to cast the result to something bigger if you want to avoid integer overflow. If each element is either 0 or 1, you need an integer format that can hold values up to at least `n` in order to guarantee no overflow. For your example where `n=10000`, (u)int16 ought to be enough. Are your real matrices sparse, by any chance? If so, you would be much better off using `scipy.sparse.csr_matrix`. — ali_m, Jan 31 '16 at 22:44
Hmm. NumPy should fix this. Maybe not relevant for int8 since you will get overflow, but also int32 is slow. — Jens Munk, Feb 01 '16 at 03:50
@ali_m The overflow situation rather serious, and I hadn't even thought about it. That means uint8's are not enough! Also, the matrices are not sparse. — étale-cohomology, Feb 01 '16 at 22:57
@morningsun It took me a while, but I managed to (kind of) understand `np.einsum`. Alas, `np.einsum('ij, jk', a, a)` (matrix multiplication) is not multi-threaded. Still, `np.einsum` turns out to be a cool function. I'm gonna start using it for other stuff. — étale-cohomology, Feb 01 '16 at 23:00
Could you give some more context for the overall problem you are trying to solve? Multiplying big integer matrices together is a rather unusual thing to do. It would be particularly useful to know more about the properties of these matrices. Are the values always either 0 or 1? If they can be larger then you may well find yourself ultimately constrained by the largest integer that can be represented using uint64. How are the matrices generated? Do they have any special structure (e.g. symmetry, blocks, bands etc.)? — ali_m, Feb 01 '16 at 23:49
@MarkDickinson M4RI looks nice, but I couldn't build it on Windows. Still, I read the paper of the algorithm they implement for matrix multiplication over F2 (ie. Z/2Z), and apparently it's faster than Magma's, which is quite remarkable. However, since I'd prefer not to be bothered with wrapping C code with Python, I'd prefer to use Sage. — étale-cohomology, Feb 02 '16 at 01:34
Multi-threading isn't the real issue here, your machine does not have enough cores to give you a 1000x speedup from multi threading. The performance differences you're seeing have to do with optimizations in cache management, cpu instructions, scheduling and the like (and maybe a small gain from multi-threading). As @ali_m said, more context would help us understand if maybe there is better approach to your overall problem. — Bi Rico, Feb 09 '16 at 17:40
@BiRico You're absolutely right. Case in point: `np.einsum` also runs single-threaded, but it multiplies two `(10,000, 10,000)` integer matrices in 800 seconds on a 2.2 GHz core. Now, if it only ran on multiple cores... then we'd be cooking. — étale-cohomology, Feb 10 '16 at 10:13
@ali_m I'm just hand-coding solutions to group theory exercises. The more general the matrix multiplication algorithm, the more representations over various finite fields we can explore. The more efficient the algorithm, the larger groups/representations we can explore on a given hardware platform. (I'm not using Sage since I'm coding this by hand for learning purposes, hoping to use an efficient matrix multiplication algorithm as my only 'black box'.) The "holy grail" would be, of course, being able to do multiplication with certain `(196882, 196882)` matrices... — étale-cohomology, Feb 10 '16 at 10:24
Have you try to use [dask](http://dask.pydata.org/en/latest/)? It will expose to use blocked algorithms on numpy arrays interfaces and returns an ndarray. — Zeugma, Feb 13 '16 at 02:58

MB-F · Accepted Answer · 2020-11-01T07:57:53.510

Note that while this answer gets old numpy might gain optimized integer support. Please verify if this answer still works faster on your setup.

Option 5 - Roll a custom solution: Partition the matrix product in a few sub-products and perform these in parallel. This can be relatively easy implemented with standard Python modules. The sub-products are computed with numpy.dot, which releases the global interpreter lock. Thus, it is possible to use threads which are relatively lightweight and can access the arrays from the main thread for memory efficiency.

Implementation:

import numpy as np
from numpy.testing import assert_array_equal
import threading
from time import time


def blockshaped(arr, nrows, ncols):
    """
    Return an array of shape (nrows, ncols, n, m) where
    n * nrows, m * ncols = arr.shape.
    This should be a view of the original array.
    """
    h, w = arr.shape
    n, m = h // nrows, w // ncols
    return arr.reshape(nrows, n, ncols, m).swapaxes(1, 2)


def do_dot(a, b, out):
    #np.dot(a, b, out)  # does not work. maybe because out is not C-contiguous?
    out[:] = np.dot(a, b)  # less efficient because the output is stored in a temporary array?


def pardot(a, b, nblocks, mblocks, dot_func=do_dot):
    """
    Return the matrix product a * b.
    The product is split into nblocks * mblocks partitions that are performed
    in parallel threads.
    """
    n_jobs = nblocks * mblocks
    print('running {} jobs in parallel'.format(n_jobs))

    out = np.empty((a.shape[0], b.shape[1]), dtype=a.dtype)

    out_blocks = blockshaped(out, nblocks, mblocks)
    a_blocks = blockshaped(a, nblocks, 1)
    b_blocks = blockshaped(b, 1, mblocks)

    threads = []
    for i in range(nblocks):
        for j in range(mblocks):
            th = threading.Thread(target=dot_func, 
                                  args=(a_blocks[i, 0, :, :], 
                                        b_blocks[0, j, :, :], 
                                        out_blocks[i, j, :, :]))
            th.start()
            threads.append(th)

    for th in threads:
        th.join()

    return out


if __name__ == '__main__':
    a = np.ones((4, 3), dtype=int)
    b = np.arange(18, dtype=int).reshape(3, 6)
    assert_array_equal(pardot(a, b, 2, 2), np.dot(a, b))

    a = np.random.randn(1500, 1500).astype(int)

    start = time()
    pardot(a, a, 2, 4)
    time_par = time() - start
    print('pardot: {:.2f} seconds taken'.format(time_par))

    start = time()
    np.dot(a, a)
    time_dot = time() - start
    print('np.dot: {:.2f} seconds taken'.format(time_dot))

With this implementation I get a speedup of approximately x4, which is the physical number of cores in my machine:

running 8 jobs in parallel
pardot: 5.45 seconds taken
np.dot: 22.30 seconds taken

It works! This is the `O(n**3)` matrix product, which does exactly `n**2` dot products, correct? — étale-cohomology, Mar 03 '16 at 15:39
It splits the Matrix product into a number of smaller Matrix products. In the extreme case this can be vector dot products. — MB-F, Mar 03 '16 at 18:24
pardot is slower than np.dot when the types are float: running 4 jobs in parallel running 8 jobs in parallel pardot: 0.13 seconds taken np.dot: 0.07 seconds taken — kory, Oct 31 '20 at 19:30
even worse when the dataset is 10x the size: pardot: 1212.89 seconds taken np.dot: 73.11 seconds taken — kory, Oct 31 '20 at 20:02
@kory This is to be expected. Please use `np.dot` for floating point multiplication. — MB-F, Nov 01 '20 at 07:54
@kazemakase yeah it would be nice to find a solution for both cases, I was just pointing out how bad it is if someone was to use this for floating point values as well. Thanks! — kory, Nov 01 '20 at 13:02

score 2 · Answer 2 · answered Feb 21 '18 at 06:20

"Why is it faster to perform float by float matrix multiplication compared to int by int?" explains why integers are so slow: First, the CPUs have high-throughput floating point pipelines. Second, BLAS has no integer-type.

Workaround: Converting the matrices to float32 values gets big speedups. How's 90x speedup on a 2015 MacBook Pro? (Using float64 is half as good.)

import numpy as np
import time

def timeit(callable):
    start = time.time()
    callable()
    end = time.time()
    return end - start

a = np.random.random_integers(0, 9, size=(1000, 1000)).astype(np.int8)

timeit(lambda: a.dot(a))  # ≈0.9 sec
timeit(lambda: a.astype(np.float32).dot(a.astype(np.float32)).astype(np.int8) )  # ≈0.01 sec

Multi-threaded integer matrix multiplication in NumPy/SciPy

2 Answers2

Linked