Slow matrix multiplication performance using MTJ/Netlib (native)

Question

I need to multiply big matrices of size 5000x5000 up to 20000x20000. I'm having a problem finding a library that has sparse matrices and yet can do fast multiplication.

First of all, I've read the previous question on the performance of Java matrix libraries (Performance of Java matrix math libraries?). Based on the top answer there, I decided to go with JBLAS since it was one of the fastest. In my case, it took roughly 50s or so to multiply a 5000x5000 matrix, which is quite a lot slower than Matlab but still tolerable.

The problem is the matrices can be quite large (up to 20k by 20k, or more), but they're generally sparse. Only 30% of the elements in the matrix are non-zeros. JBLAS doesn't provide any sparse matrix implementation, so the memory footprint required to store a large dense matrix can get quite prohibitive. I tried switching to MTJ/Netlib since it's supposed to be one of the better libraries in the benchmark that have sparse matrix. The note here (https://github.com/fommil/netlib-java/) says to get the best performance, I have to compile a native BLAS on my machine. So I downloaded OpenBLAS, compiled and installed it. I also run a few commands to set the OpenBLAS library on Ubuntu 13.10:

$ cd ~/build/OpenBLAS
$ make
$ sudo make install PREFIX=/usr/local/openblas
$ sudo cat "/usr/local/openblas/lib" > /etc/ld.so.conf.d/openblas.conf
$ sudo ldconfig
$ sudo update-alternatives --install /usr/lib/libblas.so.3 libblas.so.3 /usr/local/openblas/lib/libopenblas.so 90
$ sudo update-alternatives --config libblas.so.3

I selected my compiled OpenBLAS library in the last update-alternatives step. I assume that after this, Netlib picks up my compiled OpenBLAS library and uses it. I also ran some benchmark from http://r.research.att.com/benchmarks/R-benchmark-25.R and observed some speed-up in the before (using the default blas from ubuntu) and after case (using my compiled OpenBLAS).

However, matrix-matrix multiplication performance in MTJ is still very slow. For example, I have two matrices A = 5824x5824, W = 5824x4782. I multiply them like this in Java

Matrix AW = new FlexCompRowMatrix(A.numRows(), W.numColumns());
A.mult(W, AW);

The code has been running for more than 45 minutes now, enough to type this whole post, and it's still not finishing. Using JBLAS, the same matrix multiplication would take less than 1 minute. Is there anything that I missed ?

Thanks !

I also tried the precompiled multithreaded OpenBLAS library (http://www.personal.psu.edu/mar36/blogs/the_ubuntu_r_blog/2013/08/multi-threaded-openblas-backported-to-recent-ubuntu-releases.html), thinking that I screwed up during the compilation process. But nothing changes, matrix-matrix multiplication using MTJ/Netlib is still slow. Perhaps the question is, how come JBLAS can do it so (relatively) fast even without me having to configure anything at all ? — x112341, Nov 15 '13 at 00:08
May I ask what the purpose is of multiplying 20000x20000 matrixes? I don't see any applications with my knowledge (yet). — Martijn Courteaux, Nov 15 '13 at 00:11
Well, 20k by 20k is quite a stretch. Still, a 5k by 5k matrix is quite reasonable I think, and it's running forever ... — x112341, Nov 15 '13 at 00:18
And what information do you get by multiplying a 5k by 5k matrix? Does it solve a mathematical or physical problem? Is this meant to compute something related to real life? — Martijn Courteaux, Nov 15 '13 at 00:21
@MartijnCourteaux, there are thousands of problems that require multiplying matrices *much, much* larger than 20k elements. Everything ranging from PDE solutions through chemical reaction modelling, to DNA sequencing. Not to mention basic matrix decompositions. Not only is it a common problem, it's incredibly important in many many applications. — davin, Nov 15 '13 at 23:26
@jwd, sparse matrix storage, retrieval & computation is a beautiful subject. Not every out-of-the-box solution will be appropriate for every situation. If you have knowledge of the matrix structure, you can make a much more informed decision at what type of data structures you use (especially symmetries). Sometimes it is easier to write your own library functions very quickly that can perform very well. 30% non-zero isn't particularly sparse. That means you should expect slow runtimes, although it also means that if you're using a sparse representation that you should take that into account. — davin, Nov 15 '13 at 23:35
The BLAS specification in general only defines formats and operations for dense and very special banded matrices. You should use a library for sparse matrices. -- Often one can solve the tasks involving sparse matrices by using (approximative, iterative) algorithms using only matrix-vector products. — Lutz Lehmann, Dec 29 '13 at 16:25

score 6 · Answer 1 · answered Dec 29 '13 at 22:15

6

JBLAS does dense matrix operations. MJT does both dense and sparse. Using "sparse" matrices in a dense way is slow. FlexCompRowMatrix creates a sparse matrix.

What you want to do, to compare directly to JBLAS, is:

Matrix a = new DenseMatrix(5000,5000);
Matrix b = new DenseMatrix(5000,5000);
Matrix c = new DenseMatrix(5000,5000);

a.multAdd(b, c);

The performance using MJT+OpenBlas should be about the same as MatLab.

answered Dec 29 '13 at 22:15

Aleksandr Dubinsky

22,436
15
82
99

only comparable to MATLAB if MATLAB is using the same blas/lapack under the hood ;-) btw, why is this not marked as the answer? – fommil Nov 24 '14 at 18:53
@jwd please mark this as the answer, it is correct. – fommil Nov 24 '14 at 18:55
The author of the library @fommil has spoken, pls mark correct. – WestCoastProjects Sep 18 '15 at 10:28

score 0 · Answer 2 · answered Apr 12 '17 at 07:13

see http://jeshua.me/blog/NetlibJavaJNI and note that you may have to update the native package names in a test to demonstrate use.

for example, may need to change: Class javaBlasClass = Class.forName("org.netlib.blas.JBLAS"); to: Class javaBlasClass = com.github.fommil.netlib.BLAS.class;

Slow matrix multiplication performance using MTJ/Netlib (native)

2 Answers2