How to speed up the multiplication between a sparse matrix and a dense ndarray vector in SciPy

Question

I am trying to speed up an algorithm. The bottleneck in the algorithm is computing "Ax", where, A is a large sparse-matrix with n X m dimension and x is a dense vector with m dimension. My algorithm tries to select specific d columns of A from m columns that d << m, we also select corresponding d elements in x. we call them sub_A and sub_x, and we only need to compute the multiplication between sub_A and sub_x.

However, I found that, this kind of multiplication in scipy shows not clear speed up effects. Even if I make d < m/100, the speed up only achieve 2 times, this is quite strange. As the second dimension of A has shrank so much. I tried the similar code in matlab and got a more clear speed up. If I make d<m/100, I can speed up the computation almost 50-100 times.

I checked it on the internet and noticed that there are some strange bottleneck in the scipy code, which cause the sparse matrix multiplication with a dense [tag:NumPy] vector is extremely slow. People suggest to use pysparse or cysparse however these modules have stopped updating several years ago.

Is there any other method in python can solve the problem? Otherwise I have to move my whole project to matlab.

I have tried the computation in both python and matlab, with a 99% sparse-matrix A and a dense x.

import scipy.sparse as sp
import numpy as np
import time
m = 10000
n = 100
d = 100
times = 100
x = np.ones((m,1))

A = sp.random(n, m, density=0.01, format='csr')

start_time = time.time()
for i in range(times):
    c = A.dot(x)
end_time = time.time()

print("Ax cost:", end_time - start_time)

row_indices = np.random.choice(m, d, replace=False)
sub_x = x[row_indices]

sub_A = A[:,row_indices]

start_time = time.time()
for i in range(times):
    c = sub_A.dot(sub_x)
end_time = time.time()

print("sub_A x cost:", end_time - start_time)

The output is

Ax cost: 0.002000093460083008
sub_A dot sub_x cost: 0.0010018348693847656

Even the d = m/100, the computational speed has no huge difference.

What changes with the "shrink"? Does the number of nonzero terms change? — hpaulj, Mar 26 '23 at 05:50
You really need to show us some code if you want meaningful improvements on it — Homer512, Mar 26 '23 at 08:04
Which sparse matrix format are you using? What are typical values of `d` and `m`? — Warren Weckesser, Mar 26 '23 at 12:29
Use mkl and a python wrapper, that's way more optimized for sparse matrix vector ops. — CJR, Mar 26 '23 at 13:58
A quick set of timings indicates that while dense `A@x` is quite sensitive to the `d` dimension, the sparse `M@x` is more sensitive to the `nnz`. From `M._mul_vector` you can see that it does `fn(M, N, self.indptr, self.indices, self.data, other, result)`, where `fn` a compiled (cython?) function for this format (`csr/csc`), and `result` is a `np.zeros()` array. — hpaulj, Mar 26 '23 at 16:19
Sorry for my previous vague and bad statement, I'm happy that so many people are willing to help me! My speed up method shrank the dimension of A by select the necessary columns. the new matrix sub_A is the collection of some columns of A, and there is a corresponding sub_x, the number of the columns are d << m. I have added the specific code to the main part. — Xun S, Mar 26 '23 at 17:38
With `%timeit` I get `17.7 µs` for `sub_A@sub_x`. In my experience times like that with sparse code are dominated by calling overhead, and relatively insensitive to matrix dimensions. The dense equivalent drops from 1ms to 21 µs, benefiting much more from a change in dimensions. — hpaulj, Mar 27 '23 at 22:55
I previously thought you were dropping from a (m,m) matrix to a (d,m). Where as you are actually dropping to a (m,d). But even so, once the times get down to the 30 µs range the overhead seems to dominate. — hpaulj, Mar 27 '23 at 23:01
Thank you for your reply! Yes, the overhead has dominated the process, I have no idea how to cut down this overhead in SciPy, is there any method can help me? I tried a similar experiments in MATLAB, it seems there is no such an overhead and my algorithm can improve the performance of the algorithm. However, if coding on Python, this kind of improvements would be ignorable due to the overhead. — Xun S, Mar 28 '23 at 10:24
@hpaulj, I apologize for bothering you. I am new to Stack Overflow and I am not sure if my reply has been noticed by you as I forgot to use @ in my previous message. If you have already seen my reply, please forgive my rudeness. — Xun S, Apr 01 '23 at 07:43
@Homer512 I apologize for bothering you. I am new to Stack Overflow and I forgot to use @ in my previous message after adding the experiments code. If you have already seen my reply, please forgive my rudeness. — Xun S, Apr 01 '23 at 07:45
@Warren Weckesser I am sorry that I made mistakes on the notation and I have update the experiments code, I believe this problem is connected to the overhead like hpaulj said, however I have no idea how to deal with it. — Xun S, Apr 01 '23 at 07:47

score 0 · Answer 1 · answered Apr 01 '23 at 09:49

If you are running against the interpreter overhead, which seems to be the consensus, then there is precious little you can do.

I find some improvement by using the @ operator instead of the .dot method since operators have faster code paths. Other than that you may want to give Numba a try but I couldn't test this right now.

Other interpreters may also be worth trying. Python-3.11 came with some improvements.

Beyond that, and in general, the best way to reduce overhead is to do more with fewer calls. Try to aggregate multiple matrix-vector products into one matrix-matrix product.

How to speed up the multiplication between a sparse matrix and a dense ndarray vector in SciPy

1 Answers1