Sparse and huge matrix multiplication in pytorch or numpy

Question

I have a scenario where I need to multiply a small size vector a with a huge and highly sparse matrix b. Here's a simplified version of the code:

import numpy as np

B = 32
M = 10000000

a = np.random.rand(B)
b = np.random.rand(B, M)
b = b > 0.9

result = a @ b

In my actual use case, the b matrix is loaded from a np.memmap file due to its large size. Importantly, b remains unchanged throughout the process, and will be performing the inner product with different vectors a each time, so any pre-process on b to leverage its sparse nature is allowed.

I'm seeking suggestions on how to optimize this matrix multiplication speed. Any insights or code examples would be greatly appreciated.

Prior to loading the file, do you have knowledge of the matrix dimensions and sparsity ratio (at least an upper bound)? — Reinderien, Aug 13 '23 at 17:19
32 * 10e6 * 8 is "only" 2.6 GB. Does this really not fit in memory? — Reinderien, Aug 13 '23 at 17:29
I think you can reuse the `sparse_compute` function of [this post](https://stackoverflow.com/a/74921059/12939557) so to get a fast implementation (certainly faster than the provided answer, especially since the Numba code is parallel and quite optimized). — Jérôme Richard, Aug 13 '23 at 20:42
By the way, note that `b` is a boolean array here and such array behave pretty differently than floating-point ones. Datatypes matter a lot for performance! If you use such array in practice, then please note that it can be compacted by storing booleans values as bits rather than bytes. This makes the array 8x smaller, so only 38 MiB for `b` here! Not to mention this can also make computations significantly faster (if properly optimized), possibly even faster than sparse matrices (unless the matrix is very very sparse : >99% of zeros) — Jérôme Richard, Aug 13 '23 at 20:47
@Reinderien The upper bound of B is 100k and that of M is 10 million. I use `memmap` as it can load these huge matrix in seconds, and I find using `memmap` or not do not affect the efficiency significantly. I am not sure if memory matters in the case. — Garvey, Aug 14 '23 at 02:19

score 2 · Accepted Answer · answered Aug 13 '23 at 15:27

Testing with a smaller M - I don't want to hang my machine with memory errors or long calcs:

In [340]: B = 32
     ...: M = 10000
     ...: 
     ...: a = np.random.rand(B)
     ...: b = np.random.rand(B, M)
     ...: b = b > 0.9
     ...: 
     ...: 
In [341]: timeit a@b
1.23 ms ± 21.4 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [342]: timeit np.einsum('i,ij',a,b)
590 µs ± 3.83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)
In [343]: timeit np.einsum('i,ij',a,b, optimize=True)
1.57 ms ± 83 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

I'm a little surprised what einsum is so much faster. Often einsum ends up using the same BLAS functions as matmul, especially if optimize is turned on. The big difference between B and M dimensions probably has something to do with it. BLAS type of code probably was written with 'squarish' matrices in mind.

Trying scipy.sparse:

In [344]: from scipy import sparse
In [345]: bM = sparse.csr_matrix(b)
In [346]: timeit bM = sparse.csr_matrix(b)
3.19 ms ± 13.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [347]: bM
Out[347]: 
<32x10000 sparse matrix of type '<class 'numpy.bool_'>'
    with 32241 stored elements in Compressed Sparse Row format>
In [348]: bM.indptr
Out[348]: 
array([    0,  1017,  2008,  3028,  4049,  5041,  5971,  6955,  7930,
        8957,  9986, 10978, 12031, 13050, 14067, 15072, 16142, 17140,
       18093, 19106, 20074, 21096, 22122, 23150, 24152, 25170, 26197,
       27232, 28271, 29233, 30246, 31226, 32241], dtype=int32)

Creating the sparse matrix takes time. It in effect has to do a np.nonzero(b) on b to find the nonzero elements.

Having made it, the multiplication proceeds as before (delegating to the bM own version of matrix multiplication (don't use np.matmul, np.dot, or np.einsum with bM):

In [349]: res = a@bM
In [350]: type(res)
Out[350]: numpy.ndarray
In [351]: np.allclose(res, a@b)
Out[351]: True
In [352]: timeit a@bM
268 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)

and the time better.

I'm not sure how the csr_matrix() step would work with a memmap array. But once created that matrix can probably be stored in memory, saving that disk access time.

Thanks for the answer. Now I use csr array to store the binary matrix `b`, it offers a reduced storage footprint and lower RAM usage compared to `memmap` array. — Garvey, Aug 14 '23 at 15:10

GenZ · Answer 2 · 2023-08-13T03:25:30.793

1

I think you can use np.einsum.

import numpy as np
import time

B = 32
M = 10000000

a = np.random.rand(B)
b = np.random.rand(B, M)
b = b > 0.9

start = time.time()
result = a @ b
end = time.time() - start
print(end)

start = time.time()
a = np.expand_dims(a, 0)
result1 = np.einsum('ab,bm->am', a, b)
end = time.time() - start
print(end)
print(np.allclose(result, result1))

And the result

0.7853567600250244
0.4943866729736328
True

edited Aug 13 '23 at 03:25

answered Aug 13 '23 at 02:57

GenZ

347
8

I wonder if/why your solution is faster than doing `np.einsum('b,bm->m', a, b)` without using `np.expand_dims`? – Alexander S. Brunmayr Aug 13 '23 at 03:16
2

Also, `np.allclose(result, result)` always returns `True`... you should probably use different names for the results, like `result1` and `result2` – Alexander S. Brunmayr Aug 13 '23 at 03:18
1

@AlexanderS.Brunmayr Hi, I have just tested on `np.einsum('b,bm->m', a, b)`, which has similar efficiency performance with `np.einsum('ab,bm->am', a, b)` – Garvey Aug 13 '23 at 03:24
1

Thank you for your suggestion @Alexander S. Brunmayr. The `result` should be `result1`. And `np.einsum('b,bm->m', a, b)` is also give a slightly better result. – GenZ Aug 13 '23 at 03:27
@GenZ Thank you for suggesting the np.einsum solution. I've tested it, and it consistently offers around a 20% improvement in efficiency on my system, which is promising. While I was going through its documentation, I found myself unsure about how `np.einsum` works to speed up and whether it inherently takes advantage of the sparsity. Do you have any insight on this? – Garvey Aug 13 '23 at 03:29
2

@Garvey, I'm not sure how `np.einsum` works but I don't think `np.einsum` can take advantage of the sparsity. – GenZ Aug 13 '23 at 03:56
1

`einsum` is no faster than matrix multiply. The main issues is that (1), your matrix isn't that sparse. (2) Most algorithms for sparse matrices ensure that the matrix is in some sort of special format that takes up less space. You've still got the entire matrix and the entire matrix has to be read. Look at `scipy.sparse`. – Frank Yellin Aug 13 '23 at 06:18

Sparse and huge matrix multiplication in pytorch or numpy

2 Answers2