4

Is there a way to improve the boost ublas product performance?

I have two matrices A,B which i want to mulitply/add/sub/...

In MATLAB vs. C++ i get the following times [s] for a 2000x2000 matrix Operations

OPERATION | MATLAB | C++ (MSVC10)
A + B     |  0.04  |  0.04
A - B     |  0.04  |  0.04
AB        |  1.0   | 62.66
A'B'      |  1.0   | 54.35

Why there is such a huge performance loss here?

The matrices are only real doubles. But i also need positive definites,symmetric,rectangular products.

EDIT: The code is trivial

matrix<double> A( 2000 , 2000 );
// Fill Matrix A
matrix<double> B = A;

C = A + B;
D = A - B;
E = prod(A,B);
F = prod(trans(A),trans(B));

EDIT 2: The results are mean values of 10 trys. The stddev was less than 0.005

I would expect an factor 2-3 maybe to but not 50 (!)

EDIT 3: Everything was benched in Release ( NDEBUG/MOVE_SEMANTICS/.. ) mode.

EDIT 4: Preallocated Matrices for the product results did not affect the runtime.

8472
  • 151
  • 1
  • 8
  • Be sure to do a clean Matlab rerun, it tends to cache... well, everything. Completely unrelated, but you should be able to get decent performance, and an easy syntax from [Eigen](http://eigen.tuxfamily.org/index.php?title=Main_Page). (I'm interested how it relates to your little benchmark, hint hint :-) – rubenvb Oct 17 '11 at 19:15
  • 2
    I would expect ~2000x time for multiplication vs addition.. – ruslik Oct 17 '11 at 19:38
  • 2
    Did you remember to turn on release mode for uBlas. See the FAQ at the end of [link=http://www.boost.org/doc/libs/1_47_0/libs/numeric/ublas/doc/index.htm]this[/link] which indicate that you need to have '-DNDEBUG' or some other flags to cause ublas to compile for release. – Dave S Oct 17 '11 at 19:56
  • You don't know what role memory-management is playing here. `prod` is having to allocate a 32mb matrix, and so is `trans`, twice, and then you're doing all that 10 times. Take a few [stackhots](http://stackoverflow.com/questions/375913/what-can-i-use-to-profile-c-code-in-linux/378024#378024) and see what it's *really* doing. My dumb guess is if you pre-allocate the matrices you get a better result. – Mike Dunlavey Oct 17 '11 at 20:40
  • @Mike - `ublas` uses expression templates, so preallocating is not likely to make things a lot faster unless coder is wilfully wasting matrix copies (not the case here). – Steve Townsend Oct 17 '11 at 21:42
  • I found some nice benchmarks from the folks at Eigen [here](http://eigen.tuxfamily.org/index.php?title=Benchmark) – Kyle Heuton Mar 26 '13 at 20:39
  • See the related question (and better answers) [here](http://stackoverflow.com/questions/7798285/boost-ublas-matrix-product-extremely-slow/11382800#comment33373246_11382800). In particular, try using axpy_prod. – John Doe Feb 25 '14 at 16:36

4 Answers4

4

Post your C+ code for advice on any possible optimizations.

You should be aware however that Matlab is highly specialized for its designed task, and you are unlikely to be able to match it using Boost. On the other hand - Boost is free, while Matlab decidedly not.

I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.

Steve Townsend
  • 53,498
  • 9
  • 91
  • 140
  • 4
    +1 "*I believe that best Boost performance can be had by binding the uBlas code to an underlying LAPACK implementation.*" This is the real answer. – ildjarn Oct 17 '11 at 19:17
3

You should use noalias in the left hand side of matrix multiplications in order to get rid of unnecessary copies.

Instead of E = prod(A,B); use noalias(E) = prod(A,b);

From documentation:

If you know for sure that the left hand expression and the right hand expression have no common storage, then assignment has no aliasing. A more efficient assignment can be specified in this case: noalias(C) = prod(A, B); This avoids the creation of a temporary matrix that is required in a normal assignment. 'noalias' assignment requires that the left and right hand side be size conformant.

Stephan
  • 41,764
  • 65
  • 238
  • 329
tunc
  • 519
  • 6
  • 18
1

There are many efficient BLAS implementation, like ATLAS, gotoBLAS, MKL, use them instead.

I don't pick at the code, but guess the ublas::prod(A, B) using three-loops, no blocks and not cache friendly. If that's true, prod(A, B.trans()) will be much faster then others.

If cblas is avaiable, using cblas_dgemm to do the calculation. If not, you can simply rearrange the data, means, prod(A, B.trans()) instead.

changsheng
  • 101
  • 1
  • 6
0

You don't know what role memory-management is playing here. prod is having to allocate a 32mb matrix, and so is trans, twice, and then you're doing all that 10 times. Take a few stackhots and see what it's really doing. My dumb guess is if you pre-allocate the matrices you get a better result.

Other ways matrix-multiplication could be speeded up are

  • pre-transposing the left-hand matrix, to be cache-friendly, and

  • skipping over zeros. Only if A(i,k) and B(k,j) are both non-zero is any value contributed.

Whether this is done in uBlas is anybody's guess.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135