Why Eigen is 5x slower than ublas on following example?

Question

At Eigen version I use "true" fixed size matrices and vectors, better algorithm (LDLT versus LU at uBlas), it uses SIMD instructions internally. So, why it is slower than uBlas on following example?

I am sure, I am doing something wrong - Eigen MUST be faster, or at least comparable.

#include <boost/numeric/ublas/matrix.hpp>
#include <boost/numeric/ublas/vector.hpp>
#include <boost/numeric/ublas/lu.hpp>
#include <boost/numeric/ublas/symmetric.hpp>
#include <boost/progress.hpp>
#include <Eigen/Dense>
#include <iostream>

using namespace boost;
using namespace std;
const int n=9;
const int total=100000;

void test_ublas()
{
    using namespace boost::numeric::ublas;
    cout << "Boost.ublas ";
    double r=1.0;
    {
        boost::progress_timer t;
        for(int j=0;j!=total;++j)
        {
            //symmetric_matrix< double,lower,row_major,bounded_array<double,(1+n)*n/2> > A(n,n);
            matrix<double,row_major,bounded_array<double,n*n> > A(n,n);
            permutation_matrix< unsigned char,bounded_array<unsigned char,n> > P(n);
            bounded_vector<double,n> v;
            for(int i=0;i!=n;++i)
                for(int k=0;k!=n;++k)
                    A(i,k)=0.0;
            for(int i=0;i!=n;++i)
            {
                A(i,i)=1.0+i;
                v[i]=i;
            }
            lu_factorize(A,P);
            lu_substitute(A,P,v);
            r+=inner_prod(v,v);
        }
    }
    cout << r << endl;
}

void test_eigen()
{
    using namespace Eigen;
    cout << "Eigen ";
    double r=1.0;
    {
        boost::progress_timer t;
        for(int j=0;j!=total;++j)
        {
            Matrix<double,n,n> A;
            Matrix<double,n,1> b;
            for(int i=0;i!=n;++i)
            {
                A(i,i)=1.0+i;
                b[i]=i;
            }
            Matrix<double,n,1> x=A.ldlt().solve(b);
            r+=x.dot(x);
        }
    }
    cout << r << endl;
}

int main()
{
    test_ublas();
    test_eigen();
}

Output is:

Boost.ublas 0.50 s

488184
Eigen 2.66 s

488184

(Visual Studio 2010 x64 Release)

EDIT:

For

const int n=4;
const int total=1000000;

Output is:

Boost.ublas 0.67 s

1.25695e+006
Eigen 0.40 s

5.4e+007

I guess, such behaviour is due to uBlas version computes factorization in-place, while Eigen version creates COPY of matrix (LDLT) - so it fits cache worse.

Is there any way to do inplace computation in Eigen? Or maybe there are other ways to improve it?

EDIT:

Following Fezvez advice and use LLT instead of LDLT I get:

Eigen 0.16 s

488184

It is good, but it still does unnecesary matrix stack allocation:

sizeof(A.llt()) == 656

I prefer to avoid it - it should be even faster.

EDIT:

I have removed allocation, by subclassing from LDLT (it's internal matrix is protected), and filling it directly. Now result for LDLT is:

Eigen 0.26 s
488209

It works, but it is workaround - not a real solution...

Subclassing from LLT also works, but does not provide such great effect.

ggael · Answer 1 · 2012-11-15T08:36:37.760

8

You benchmark is not fair because the ublas version solve inplace, while the Eigen version could be easily adjusted to do so:

b=A.ldlt().solve(b);
r+=b.dot(b);

Compiling with g++-4.6 -O2 -DNDEBUG, I get (on a 2.3GHz CPU):

Boost.ublas 0.15 s
488184

Eigen 0.08 s
488184

Also make sure you compiled with optimization enabled, and SSE2 enabled if you are running a 32 bit system or (32 bits compilation chain).

EDIT: I also tried to avoid the matrix copy, but this results in zero gain at all.

Also, increasing n, increase the speed difference (here n=90):

Boost.ublas 0.47 s
Eigen 0.13 s

edited Nov 15 '12 at 08:36

answered Nov 15 '12 at 07:46

ggael

28,425
2
65
71

"version could be easily adjusted to do so" - there is better alternative to this - solveInPlace. But it is all about vectors in equations, not about Matrix - so doesn't have affect at all. "Also make sure you compiled with optimization enabled" - yes, I checked it: 1) I am running x64 - sse enabled by default, 2) I tried to even explicitly enable sse - same result. "Compiling with g++" - question was about MSVC 2010. – qble Nov 15 '12 at 09:03
"I also tried to avoid the matrix copy, but this results in zero gain at all." - how exactly you did that? By subclassing? "Also, increasing n, increase the speed difference (here n=90):" - such increasing is pointless in contex of that question. When you increase N to such rather high numbers, complexity of solution itself O(N^3) starting dominate. While at N=9 - I am hitting some cache boundaries at Eigen case, and that hits performance.. – qble Nov 15 '12 at 09:05
Well 2.66s with MSVC and 0.08s with GCC for the same code means the problem is on the compiler, not Eigen code. In general such an overhead is caused by bad inlining.The huge speed-up you obtained when subclassing LDLT show that your modifications helped MSVC to produce better code. If you could send me the ASM generated for the initial code, we could identify the precise problem and fix it upstream. You modifications might be interesting too – ggael Nov 15 '12 at 10:27
"Well 2.66s with MSVC and 0.08s with GCC for the same code means the problem is on the compiler, not Eigen code" - some compilers can optimize more aggressively than others. However, in that particular case, library has all tools to avoid unnecesary copy, without relying on mighty optimizer. What is needed is just factorization operation which is allowed to modify original matrix - that's all, Boost.uBlas does exactly this, my modifications also do exactly same thing. While Eigen does copy of matrix: "m_matrix = a;" - that's the problem, it is identifiable without ASM inspecting. – qble Nov 15 '12 at 11:05
1

Sorry to insist, but no this copy is not the source of the performance drop: 1) I can see that this copy is not optimized away by GCC, and there is no slow down with gcc; 2) the LLT solver also performs a copy, but you did not observed any slow down in the LLT case; 3) you are arguing that an additional O(n^2) op could explain a 10x slowdown of a O(n^3) operation that is a non sense; 4) 9x9 matrices allocated on the stack are too small to incriminate cache misses. – ggael Nov 15 '12 at 13:02
1) some other things can be optimized, it may use better stack layout, etc 2) LLT uses less data - just copy of matrix, but LDLT uses additional data and it's size bigger, just check the source. When I modify llt struct by adding additional data - it also slows down. 3) I am not arguing that copy operation itself is an issue, I am saying that memory layout is an issue, not copy itself (when I increased size of LLT struct - there were no any copy operation, just raw data size change). 4) my LDLT change that make it much faster, were about doing inplace calculations ,ie. remove of one alloc – qble Nov 15 '12 at 18:15

B. Decoster · Answer 2 · 2012-11-14T13:19:13.400

2

I followed your hint about in place computation:

Using the exact same code, and using the functions A.llt().solveInPlace(b); and A.ldlt().solveInPlace(b); (which replaces b by x), I got that

There were  100000  Eigen standard LDLT linear solvers applied in  12.658  seconds 
There were  100000  Eigen standard LLT linear solvers applied in  4.652  seconds 

There were  100000  Eigen in place LDLT linear solvers applied in  12.7  seconds 
There were  100000  Eigen in place LLT linear solvers applied in  4.396  seconds

Perhaps an LLT solver is just more appropriate than a LDLT solver for this kind of problem? (I can see that you're dealing with your matrices of size 9)

(By the way, I answered a bit that previous question you had about your linear solver in dimension 9, and I am very sad to see that Eigen's implementation of LDLT has a lot of overhead...)

edited Nov 14 '12 at 13:19

answered Nov 14 '12 at 13:13

B. Decoster

7,723
1
32
52

solveInPlace just uses vector inplace, not matrix inplace. Matrix is still allocated. – qble Nov 14 '12 at 13:19
Yup, but what I wanted to tell you is that I followed your hint about solving inplace. Of course, I was disappointed to read that it just replaced b instead of doing some matrix inplace stuff, but I still tried. And I stumbled upon another result, that is that LLT seems to be better than LDLT for this particular problem =) – B. Decoster Nov 14 '12 at 13:21
"I am very sad to see that Eigen's implementation of LDLT has a lot of overhead" - I am also sad to see that, I recommended Eigen to many peoples. – qble Nov 14 '12 at 13:31
1

I workaround stack allocation, and now doing in-place LDLT - now Eigen 2x faster than uBlas, check edit at bottom of question. – qble Nov 14 '12 at 13:59

Why Eigen is 5x slower than ublas on following example?

2 Answers2