When should I be using `sparse`?

Question

I've been looking through Matlab's sparse documentation trying to find whether there are any guidelines for when it makes sense to use a sparse representation rather than a full representation.

For example, I have a matrix data with around 30% nonzero entries. I can check the memory used.

whos data
  Name             Size                 Bytes  Class     Attributes

  data      84143929x11            4394073488  double    sparse

data = full(data);
whos data
  Name             Size                 Bytes  Class     Attributes

  data      84143929x11            7404665752  double

Here, I'm clearly saving memory, but would this be true of any matrix with 30% nonzero entries? What about 50% nonzero entries? Is there a rule of thumb for at what percentage I should switch to a full matrix?

What about computationally? Is it as a rule slower or faster to do a matrix multiplication with a sparse matrix? Sparse Matrix Operations says that

The computational complexity of sparse operations is proportional to nnz, the number of nonzero elements in the matrix. Computational complexity also depends linearly on the row size m and column size n of the matrix, but is independent of the product m*n, the total number of zero and nonzero elements.

This is difficult to compare to a full matrix without knowing more details.

Scipy's sparse matrix library explains pros and cons of each sparse format. For example for the csc_matrix

Advantages of the CSC format

efficient arithmetic operations CSC + CSC, CSC * CSC, etc.

efficient column slicing

fast matrix vector products (CSR, BSR may be faster)

Disadvantages of the CSC format

slow row slicing operations (consider CSR)

changes to the sparsity structure are expensive (consider LIL or DOK)

Does similar information about Matlab's sparse implementation exist? If so where can I find it?

Let me know if you want me to go into detail for any part of my answer — krisdestruction, Apr 21 '15 at 22:00
I would offer a bounty for this question but I have really low rep. @LuisMendo perhaps you would like to? — krisdestruction, Apr 22 '15 at 03:12
From the sparse matrices documentation, selected bibliography: Gilbert, John R., Cleve Moler, and Robert Schreiber, "Sparse Matrices in MATLAB: Design and Implementation," SIAM J. Matrix Anal. Appl., Vol. 13, No. 1, January 1992, pp. 333-356. That's probably a good starting point. — Sam Roberts, Apr 22 '15 at 09:43

Matthew Gunn · Accepted Answer · 2016-07-21T17:33:21.927

Many operations on full matrices use BLAS/LAPACK library calls that are insanely optimized and tough to beat. In practice, operations on sparse matrices will only outperform those on full matrices in specialized situations that can sufficiently exploit (i) sparsity and (ii) special matrix structure.

Just randomly using sparse probably will make you worse off. Example: which is faster, adding a 10000x10000 full matrix to a 10000x10000 full matrix? Or adding a 10000x10000 full matrix to an entirely sparse (i.e. everything is zero) 10000x10000 matrix? try it! On my system, the full + full is faster!

What are some examples of situations where sparse CRUSHES full?

Example 1: solving linear system A*x=b where A is 5000x5000 but is block diagonal matrix constructed of 500 5x5 blocks. Setup code:

As = sparse(rand(5, 5));
for(i=1:999)
   As = blkdiag(As, sparse(rand(5,5))); 
end;                         %As is made up of 500 5x5 blocks along diagonal
Af = full(As); b = rand(5000, 1);

Then you can test speed difference:

As \ b % operation on sparse As takes .0012 seconds
Af \ b % solving with full Af takes about 2.3 seconds

In general, a 5000 variable linear system is somewhat difficult, but 1000 separate 5 variable linear systems is trivial. The latter is basically what gets solved in the sparse case.

The overall story is that if you have special matrix structure and can cleverly exploit sparsity, it's possible to solve insanely large problems that otherwise would be intractable. If you have a specialized problem that is sufficiently large, have a matrix that is sufficiently sparse, and are clever with linear algebra (so as to preserve sparsity), a sparse typed matrix can be extremely powerful.

On the other hand, randomly throwing in sparse without deep, careful thought is almost certainly going to make your code slower.

score 3 · Answer 2 · edited May 23 '17 at 12:25

3

I am not an expert in using sparse matrices, however Mathworks does have some documentation pertaining to the operation and computation efficiency.

Their computation complexity description:

The computational complexity of sparse operations is proportional to nnz, the number of nonzero elements in the matrix. Computational complexity also depends linearly on the row size m and column size n of the matrix, but is independent of the product m*n, the total number of zero and nonzero elements.

The complexity of fairly complicated operations, such as the solution of sparse linear equations, involves factors like ordering and fill-in, which are discussed in the previous section. In general, however, the computer time required for a sparse matrix operation is proportional to the number of arithmetic operations on nonzero quantities.

Without boring you with the algorithmic details, another answer suggests you shouldn't bother with sparse for an array that is only 25% non-zeros. They offer some code for you to test on. See their post for details.

A = sprand(2000,2000,0.25);
tic,B = A*A;toc
Elapsed time is 1.771668 seconds.

Af = full(A);
tic,B = Af*Af;toc
Elapsed time is 0.499045 seconds.

edited May 23 '17 at 12:25

Community

1
1

answered Apr 21 '15 at 21:59

krisdestruction

1,950
1
10
20

I found the same documentation of complexity. But how can you compare this to full matrix operation complexity? For example, let's assume full matrix multiplication for an nxm matrix and an mxp matrix is O(nmp). If the documentation for sparse matrices says that the complexity is linear with m and n, can I jump to the conclusion that multiplication is O(m + n) or O(nnz(m) + nnz(n))? – Cecilia Apr 21 '15 at 22:11
As far as memory use, I could extend that code snippet to fit a function to a changing percentage of nonzeros. That might be useful, but it seems like these characteristics should also be intrinsic to the implementation, and we should be able to prove that the matrix implementation would use a certain amount of memory given some number of non-zero entries. – Cecilia Apr 21 '15 at 22:15
Basically. I'd like the boring algorithmic details pretty please. :) – Cecilia Apr 21 '15 at 22:15
@Cecilia I appreciate someone who likes the details, but unfortunately I don't know it (as I said I'm not an expert). Now the "Permutations and Reordering" section appears to describe how it permutates the multiplication. Perhaps someone could be of more use? :/ – krisdestruction Apr 21 '15 at 22:30

Kolya Ivankov · Answer 3 · 2017-12-08T07:32:08.863

If you have a matrix of a fixed dimension, then the best way to establish a reliable answer is just trial and error. However, if you do not know the dimensions of your matrices/vectors, then the rules of thumb are

Your sparse vectors should have effectively constant number of nonzero entries

which for matrices will imply

Your N x N sparse matrix should have <= c * N nonzero entries, where c is a constant "much less" than N.

Let's give a pseudo-theoretical explanation to this rule. We shall consider a fairly easy task of making a scalar (or dot) product of two vectors with double valued coordinates. Now, if you have two dense vectors of the same length N, your code will look like

//define vectors vector, wector as double arrays of length N 
double sum = 0;
for (int i = 0; i < N; i++)
{
    sum += vector[i] * wector[i];
}

this amounts in N additions, N multiplications and N conditinal branches (cycle operations). The most costly operation here is the conditional branch, so costly, that we may neglect multiplications and the more so additions. The reason why it is so expensive is explained in an answer to this question.

UPD: In fact, in a for cycle, you risk to choose a wrong branch only once, at the end of your cycle, since by definition the default branch to choose will be going into the cycle. This amounts in at most 1 pipeline restart per scalar product operation.

Let's now have a look at how sparse vectors are realized in BLAS. There, each vector is encoded by two arrays: one of values and one of corresponding indices, something like

1.7    -0.8    3.6
171     83     215

(plus one integer telling the supposed length N). It is indicated in the BLAS documentation, that the ordering of indices plays no role here, so that the data

-0.8    3.6    1.7
 83     215    171

encodes the same vector. This remark gives enough information to reconstruct the algorithm for scalar product. Given two sparse vectors encoded by the data int[] indices, double[] values and int[] jndices, double[] walues, one will calculate their scalar product in the lines of this "code":

double sum = 0;
for (int i = 0; i < indices.length; i++)
{
    for (int j = 0; j < jndices.length; j++)
    {
        if(indices[i] == jndices[j])
        {
            sum += values[indices[i]] * walues[jndices[j]];
        }
    }
}

which gives us a total amount of indices.length * jndices.length * 2 + indices.length conditional branches. This means that just in order to cope with the dense algorithm, your vectors are to have at most sqrt(N) nonzero entries. The point here is the dependency on N is already nonlinear, so there is no point in asking whether you need 1% or 10% or 25% filling. 10% is perfect for vectors of length 10, still sort of OK for length 50 and already a total ruin for length 100.

UPD. In this code snippet, you have an if branch, and the probability to take the wrong path is 50%. Thus, a scalar product of two sparse vectors will amount in about 0.5 to 1 times the average number of nonzero entries per sparse vector) pipeline restarts, depending on how sparse your vectors are . The numbers are to be adjusted: in an if statement without else, the shortest instruction will be taken first, which is "do nothing", but still.

Note that the most efficient operation is a scalar product of a sparse and a dense vector. Given a sparse vector of indices and values and a dense vector dense, your code will look like

double sum = 0;
for (int i = 0; i < indices.length; i++)
{
    sum += values[indices[i]] * dense[indices[i]];
}

i.e. you'll have indices.length conditional branches, which is good.

UPD. Once again, I bet you'll have at most one pipeline restart per operation. Note also that afaik in modern multicore processors both alternatives are performed in parallel on two different cores, so that in alternative branches you only need to wait for the longest one to finish.

Now, when multiplying matrix with a vector, you basically take #rows scalar products of vectors. Multiplying matrix with matrix amounts in taking #((nonzero) columns in the second matrix) of matrix by vector multiplications. You are welcome figure out the complexity by yourself.

And so here is where all the ~~black magic~~ deep theory of different matrix storing begins. You may store your sparse matrix as dense array of sparse rows, as a sparse array of dense rows or sparse array of sparse rows. Same goes for columns. All the funny abbreviations from Scipy cited in the question have to do with that.

You will "always" have an advantage in speed if you multiply a matrix built of sparse rows with a dense matrix, or a matrix of dense columns. You may want to store your sparse matrix data as dense vectors of diagonals - so in the case of convolution neural networks - and then you'll need completely different algorithms. You may want to make your matrix a block matrix - so does BLAS - and get a reasonable computation boost. You may want to store your data as two matrices - say, a diagonal and a sparse, which is the case for finite element method. You could make use of sparsity for general neural networks (like. fast forward, extreme learning machine or echo state network) if you always multiply a row stored matrix by a column vector, but avoid multiplying matrices. And, you will "always" get an advantage by using sparse matrices if you follow the rule of thumb - it holds for finite element and convolution networks, but fails for reservoir computing.

When should I be using `sparse`?

3 Answers3

What are some examples of situations where sparse CRUSHES full?

Linked