Multithreaded sparse matrix multiplication in Matlab

Question

I am performing several matrix multiplications of an NxN sparse (~1-2%) matrix, let's call it B, with an NxM dense matrix, let's call it A (where M < N). N is large, as is M; on the order of several thousands. I am running Matlab 2013a.

Now, usually, matrix multiplications and most other matrix operations are implicitly parallelized in Matlab, i.e. they make use of multiple threads automatically. This appears NOT to be the case if either of the matrices are sparse (see e.g. this StackOverflow discussion - with no answer for the intended question - and this largely unanswered MathWorks thread). This is a rather unhappy surprise for me.

We can verify that multithreading has no effects for sparse matrix operations by the following code:

clc; clear all; 

N = 5000;         % set matrix sizes
M = 3000;       
A = randn(N,M);   % create dense random matrices
B = sprand(N,N,0.015); % create sparse random matrix
Bf = full(B);     %create a dense form of the otherwise sparse matrix B

for i=1:3 % test for 1, 2, and 4 threads
  m(i) = 2^(i-1);
  maxNumCompThreads(m(i)); % set the thread count available to Matlab
  tic                      % starts timer
    y = B*A; 
  walltime(i) = toc;       % wall clock time
  speedup(i) = walltime(1)/walltime(i);
end

% display number of threads vs. speed up relative to just a single thread
[m',speedup']

This produces the following output, which illustrates that there is no difference between using 1, 2, and 4 threads for sparse operations:

threads   speedup
1.0000    1.0000
2.0000    0.9950
4.0000    1.0155

If, on the other hand, I replace B by its dense form, refered to as Bf above, I get significant speedup:

threads   speedup
1.0000    1.0000
2.0000    1.8894
4.0000    3.4841

(illustrating that matrix operations for dense matrices in Matlab are indeed implicitly parallelized)

So, my question: is there any way at all to access a parallelized/threaded version of matrix operations for sparse matrices (in Matlab) without converting them to dense form? I found one old suggestion involving .mex files at MathWorks, but it seems the links are dead and not very well documented/no feedback? Any alternatives?

It seems to be a rather severe restriction of implicit parallelism functionality, since sparse matrices are abound in computationally heavy problems, and hyperthreaded functionality highly desirable in these cases.

http://www.mathworks.com/help/matlab/math/sparse-matrix-operations.html and http://www.mathworks.com/matlabcentral/answers/105015-how-can-i-do-a-memory-efficient-sparse-matrix-multiplication — Yvon, Aug 15 '14 at 16:01
@Yvon In the links I see a general description of how things work, yet I can't make out how it is relevant for the question. — Dennis Jaheruddin, Nov 24 '14 at 15:01
Just a silly afterthought: Does it help to make the full matrix sparse? — Dennis Jaheruddin, Nov 24 '14 at 15:10
@DennisJaheruddin it helps in terms of speedup, but it is not practical in terms of memory. that's the reason for the question. — Daniyar, Nov 24 '14 at 15:47
@Daniyar The information that M was large was hidden by a formatting problem, have edited the question to fix this. -- Still, going from full to sparse should only make the matrix twice as large to store, so unless you are close to the memory limit it may be an interesting approach. — Dennis Jaheruddin, Nov 24 '14 at 16:03
@DennisJaheruddin I have a very similar problem, but for me, going dense is 1000x more than sparse, and not possible at all. :( — Daniyar, Nov 24 '14 at 16:12
@Daniyar I actually suggested going from dense to sparse, rather than from sparse to dense. Not sure if it will help but at least then you would be multiplying two matrices of the same type. — Dennis Jaheruddin, Nov 24 '14 at 16:14
Have you tried [SuiteSparse](http://faculty.cse.tamu.edu/davis/suitesparse.html)? In C there's also [CSparse](http://people.sc.fsu.edu/~jburkardt/c_src/csparse/csparse.html). — horchler, Nov 24 '14 at 17:57

score 7 · Answer 1 · answered Nov 24 '14 at 18:59

7

MATLAB already uses SuiteSparse by Tim Davis for many of its operation on sparse matrices (for example see here), but neither of which I believe are multithreaded.

Usually computations on sparse matrices are memory-bound rather than CPU-bound. So even you use a multithreaded library, I doubt you will see huge benefits in terms of performance, at least not comparable to those specialized in dense matrices...

After all that the design of sparse matrices have different goals in mind than regular dense matrices, where efficient memory storage is often more important.

I did a quick search online, and found a few implementations out there:

sparse BLAS, spBLAS, PSBLAS. For instance, Intel MKL and AMD ACML do have some support for sparse matrices
cuSPARSE, CUSP, VexCL, ViennaCL, etc.. that run on the GPU.

answered Nov 24 '14 at 18:59

Amro

123,847
25
243
454

The libraries that support sparse matrices do not necessarily support multithreaded multiplication of sparse matrices. – Daniyar Nov 24 '14 at 19:16
1

@Daniyar: maybe not the reference implementations, but optimized ones like those in Intel MKL are multithreaded for both dense and sparse routines: https://software.intel.com/en-us/articles/parallelism-in-the-intel-math-kernel-library. Note to mention the GPU libraries which are inherently parallel. – Amro Nov 24 '14 at 19:50
@Daniyar: by the way, MATLAB does have support for parallel sparse matrices in [distributed-memory](https://en.wikipedia.org/wiki/Distributed_memory) scenarios (as opposed to shared-memory multi-threads) with the Parallel Computing Toolbox http://www.mathworks.com/help/distcomp/sparse.html. This is based on [ScaLAPACK](http://www.netlib.org/scalapack/index.html) library. – Amro Nov 24 '14 at 20:24
Wouldn't the parallel computation allow multithreading if used on a single computer? (As I suggested in my answer) – Dennis Jaheruddin Nov 27 '14 at 13:48
1

@DennisJaheruddin: which implementation are you talking about? Intel MKL, then yes. But MATLAB does not use the sparse routines from MKL, rather the SuiteSparse library. As for the PCT toolbox `distributed` arrays, those are parallelized with multi-processes not multi-threads (think [MPI](https://en.wikipedia.org/wiki/Message_Passing_Interface)) – Amro Nov 27 '14 at 13:54
I am clearly out of my depth as I don't know half the words from your reply!! My line of thought: `parfor` attempts to calculate things in parallel, if you just have 1 computer, I assumed it would result in use of multiple threads. I suppose it are in fact multiple processes, but I think that should not matter as I suggested to do `A*B` parallel to `C*D` where there is no memory overlap anyway. – Dennis Jaheruddin Nov 27 '14 at 14:07
@DennisJaheruddin: `parfor` will run in parallel (with multiple background processes either locally or even on a remote computer cluster) when you have independent loop iterations, but it cannot divide a matrix in an expression inside the loop block. If you want that kind of parallelism, you'll have to explicitly use `distributed` arrays: http://www.mathworks.com/help/distcomp/distributed-arrays-and-spmd.html – Amro Nov 27 '14 at 14:25
@DennisJaheruddin: If you are interested in seeing an example, I've answered a question in the past using `smpd` with different types of "distributed" arrays from the PCT Toolbox: http://stackoverflow.com/a/21130157/97160 – Amro Nov 27 '14 at 14:35

score 2 · Answer 2 · answered Dec 01 '14 at 20:24

2

I ended up writing my own mex file with OpenMP for multithreading. Code as follows. Don't forget to use -largeArrayDims and /openmp (or -fopenmp) flags when compiling.

#include <omp.h>
#include "mex.h"
#include "matrix.h"

#define ll long long

void omp_smm(double* A, double*B, double* C, ll m, ll p, ll n, ll* irs, ll* jcs)
{
    for (ll j=0; j<p; ++j)
    {
        ll istart = jcs[j];
        ll iend = jcs[j+1];
        #pragma omp parallel for
        for (ll ii=istart; ii<iend; ++ii)
        {
            ll i = irs[ii];
            double aa = A[ii];
            for (ll k=0; k<n; ++k)
            {
                C[i+k*m] += B[j+k*p]*aa;
            }
        }
    }
}


void mexFunction(int nlhs, mxArray *plhs[], int nrhs, const mxArray *prhs[])
{
    double *A, *B, *C; /* pointers to input & output matrices*/
    size_t m,n,p;      /* matrix dimensions */

    A = mxGetPr(prhs[0]); /* first sparse matrix */
    B = mxGetPr(prhs[1]); /* second full matrix */

    mwIndex * irs = mxGetIr(prhs[0]);
    mwIndex * jcs = mxGetJc(prhs[0]);

    m = mxGetM(prhs[0]);  
    p = mxGetN(prhs[0]);
    n = mxGetN(prhs[1]);

    /* create output matrix C */
    plhs[0] = mxCreateDoubleMatrix(m, n, mxREAL);
    C = mxGetPr(plhs[0]);

    omp_smm(A,B,C, m, p, n, (ll*)irs, (ll*)jcs);
}

answered Dec 01 '14 at 20:24

Daniyar

1,680
2
15
23

+ 1 for providing the code, even though this [naive algorithm](https://en.wikipedia.org/wiki/Matrix_multiplication#Algorithms_for_efficient_matrix_multiplication) is inefficient, with cubic running time `O(m*p*n)`. It would be interesting to see how this compares against the optimized (and multithreaded) implementation in Intel MKL, namely the routine: [mkl_dcsrmm](https://software.intel.com/en-us/node/468594) (matrix-matrix product `C=A*B` with `A` general sparse matrix in CSR format and `B` and `C` dense matrices) – Amro Dec 03 '14 at 17:37
@Amro this is naive, since it uses matlab's data structure. The non-naive version may perform better, however it depends on the sparsity of the matrix. – Daniyar Dec 04 '14 at 13:46
1

@Daniyar Could you include a benchmark? Preferably one showing a typical case where this is faster, and one where the basic functionality is faster? – Dennis Jaheruddin Apr 17 '15 at 08:13
@DennisJaheruddin this should be at least as fast as the default matlab multiplication just because of multi-threading. – Daniyar Apr 27 '15 at 10:46
@Daniyar Perhaps it 'should be', but it would be nice to see that it really is (and how much faster). – Dennis Jaheruddin Apr 28 '15 at 09:38

score 1 · Answer 3 · answered Nov 24 '14 at 15:07

On matlab central the same question was asked, and this answer was given:

I believe the sparse matrix code is implemented by a few specialized TMW engineers rather than an external library like BLAS/LAPACK/LINPACK/etc...

Which basically means, that you are out of luck.

However I can think of some tricks to achieve faster computations:

If you need to do several multiplications: do multiple multiplications at once and process them in parallel?
If you just want to do one multiplication: Cut the matrix into pieces (for example top half and bottom half), do the calculations of the parts in parallel and combine the results afterwards

Probably these solutions will not turn out to be as fast as properly implemented multithreading, but hopefully you can still get a speedup.

Multithreaded sparse matrix multiplication in Matlab

3 Answers3