If you have a matrix of a fixed dimension, then the best way to establish a reliable answer is just trial and error. However, if you do not know the dimensions of your matrices/vectors, then the rules of thumb are
Your sparse vectors should have effectively constant number of nonzero entries
which for matrices will imply
Your N x N
sparse matrix should have <= c * N
nonzero entries, where c
is a constant "much less" than N
.
Let's give a pseudo-theoretical explanation to this rule. We shall consider a fairly easy task of making a scalar (or dot) product of two vectors with double valued coordinates. Now, if you have two dense vectors of the same length N
, your code will look like
//define vectors vector, wector as double arrays of length N
double sum = 0;
for (int i = 0; i < N; i++)
{
sum += vector[i] * wector[i];
}
this amounts in N
additions, N
multiplications and N
conditinal branches (cycle operations). The most costly operation here is the conditional branch, so costly, that we may neglect multiplications and the more so additions. The reason why it is so expensive is explained in an answer to this question.
UPD: In fact, in a for
cycle, you risk to choose a wrong branch only once, at the end of your cycle, since by definition the default branch to choose will be going into the cycle. This amounts in at most 1 pipeline restart per scalar product operation.
Let's now have a look at how sparse vectors are realized in BLAS. There, each vector is encoded by two arrays: one of values and one of corresponding indices, something like
1.7 -0.8 3.6
171 83 215
(plus one integer telling the supposed length N
). It is indicated in the BLAS documentation, that the ordering of indices plays no role here, so that the data
-0.8 3.6 1.7
83 215 171
encodes the same vector. This remark gives enough information to reconstruct the algorithm for scalar product. Given two sparse vectors encoded by the data int[] indices, double[] values
and int[] jndices, double[] walues
, one will calculate their scalar product in the lines of this "code":
double sum = 0;
for (int i = 0; i < indices.length; i++)
{
for (int j = 0; j < jndices.length; j++)
{
if(indices[i] == jndices[j])
{
sum += values[indices[i]] * walues[jndices[j]];
}
}
}
which gives us a total amount of indices.length * jndices.length * 2 + indices.length
conditional branches. This means that just in order to cope with the dense algorithm, your vectors are to have at most sqrt(N)
nonzero entries. The point here is the dependency on N
is already nonlinear, so there is no point in asking whether you need 1% or 10% or 25% filling. 10% is perfect for vectors of length 10, still sort of OK for length 50 and already a total ruin for length 100.
UPD. In this code snippet, you have an if
branch, and the probability to take the wrong path is 50%. Thus, a scalar product of two sparse vectors will amount in about 0.5 to 1 times the average number of nonzero entries per sparse vector) pipeline restarts, depending on how sparse your vectors are . The numbers are to be adjusted: in an if
statement without else
, the shortest instruction will be taken first, which is "do nothing", but still.
Note that the most efficient operation is a scalar product of a sparse and a dense vector. Given a sparse vector of indices
and values
and a dense vector dense
, your code will look like
double sum = 0;
for (int i = 0; i < indices.length; i++)
{
sum += values[indices[i]] * dense[indices[i]];
}
i.e. you'll have indices.length
conditional branches, which is good.
UPD. Once again, I bet you'll have at most one pipeline restart per operation. Note also that afaik in modern multicore processors both alternatives are performed in parallel on two different cores, so that in alternative branches you only need to wait for the longest one to finish.
Now, when multiplying matrix with a vector, you basically take #rows scalar products of vectors. Multiplying matrix with matrix amounts in taking #((nonzero) columns in the second matrix) of matrix by vector multiplications. You are welcome figure out the complexity by yourself.
And so here is where all the black magic deep theory of different matrix storing begins. You may store your sparse matrix as dense array of sparse rows, as a sparse array of dense rows or sparse array of sparse rows. Same goes for columns. All the funny abbreviations from Scipy cited in the question have to do with that.
You will "always" have an advantage in speed if you multiply a matrix built of sparse rows with a dense matrix, or a matrix of dense columns. You may want to store your sparse matrix data as dense vectors of diagonals - so in the case of convolution neural networks - and then you'll need completely different algorithms. You may want to make your matrix a block matrix - so does BLAS - and get a reasonable computation boost. You may want to store your data as two matrices - say, a diagonal and a sparse, which is the case for finite element method. You could make use of sparsity for general neural networks (like. fast forward, extreme learning machine or echo state network) if you always multiply a row stored matrix by a column vector, but avoid multiplying matrices. And, you will "always" get an advantage by using sparse matrices if you follow the rule of thumb - it holds for finite element and convolution networks, but fails for reservoir computing.