How to speed up Eigen library's matrix product?

Question

I'm studying simple multiplication of two big matrices using the Eigen library. This multiplication appears to be noticeably slower than both Matlab and Python for the same size matrices.

Is there anything to be done to make the Eigen operation faster?

Problem Details

X : random 1000 x 50000 matrix

Y : random 50000 x 300 matrix

Timing experiments (on my late 2011 Macbook Pro)

Using Matlab: X*Y takes ~1.3 sec

Using Enthought Python: numpy.dot( X, Y) takes ~ 2.2 sec

Using Eigen: X*Y takes ~2.7 sec

Eigen Details

You can get my Eigen code (as a MEX function): https://gist.github.com/michaelchughes/4742878

This MEX function reads in two matrices from Matlab, and returns their product.

Running this MEX function without the matrix product operation (ie just doing the IO) produces negligible overhead, so the IO between the function and Matlab doesn't explain the big difference in performance. It's clearly the actual matrix product operation.

I'm compiling with g++, with these optimization flags: "-O3 -DNDEBUG"

I'm using the latest stable Eigen header files (3.1.2).

Any suggestions on how to improve Eigen's performance? Can anybody replicate the gap I'm seeing?

UPDATE The compiler really seems to matter. The original Eigen timing was done using Apple XCode's version of g++: llvm-g++-4.2.

When I use g++-4.7 downloaded via MacPorts (same CXXOPTIMFLAGS), I get 2.4 sec instead of 2.7.

Any other suggestions of how to compile better would be much appreciated.

You can also get raw C++ code for this experiment: https://gist.github.com/michaelchughes/4747789

./MatProdEigen 1000 50000 300

reports 2.4 seconds under g++-4.7

do you know what algorithm it implements? looks like it may just be using a crappy matrix multiplication algorithm. one other thing to try is to enable auto vectorization : http://gcc.gnu.org/projects/tree-ssa/vectorization.html (not on by default, I don't think... well, maybe. not sure). if you're on an intel machine, try using intel compiler... i've noticed that it always outperforms everyone else in optimization. also see here http://eigen.tuxfamily.org/index.php?title=FAQ#Vectorization — thang, Feb 09 '13 at 07:48
@thang: Eigen was designed for linear algebra, so I'd be surprised if the algorithm used is that bad. tree vectorization is enabled by default with the "-O3" optimization flag I'm using according to your link, so that's not the issue AFAIK. I might try Intel compiler if no other suggestions crop up. — Mike Hughes, Feb 10 '13 at 00:05
@MikeHuges, you could also try plotting the growth rate as the size of the matrix increases and may give some hints as to what's going on. that should give an indication of which algorithm it uses. or well, dig into their source or documentation. — thang, Feb 10 '13 at 01:45
Hi, it took me about 260 secs to run the testing C++ code in my machine, I use VS2012 and windows, my processor is core i5-4570. While testing the matrix multiply, it also took me about 1.3 secs. That quite wired. — user978112, Apr 06 '15 at 03:59

ggael · Accepted Answer · 2013-10-31T12:40:26.530

12

First of all, when doing performance comparison, makes sure you disabled turbo-boost (TB). On my system, using gcc 4.5 from macport and without turbo-boost, I get 3.5s, that corresponds to 8.4 GFLOPS while the theoretical peak of my 2.3 core i7 is 9.2GFLOPS, so not too bad.

MatLab is based on Intel MKL, and seeing the reported performance, it clearly uses a multithreaded version. It is unlikely that an small library as Eigen can beat Intel on its own CPU!

Numpy can uses any BLAS library, Atlas, MKL, OpenBLAS, eigen-blas, etc. I guess that in your case it was using Atlas which is fast too.

Finally, here is how you can get better performance: enable multi-threading in Eigen by compiling with -fopenmp. By default Eigen uses for the number of the thread the default number of thread defined by OpenMP. Unfortunately this number corresponds to the number of logic cores, and not physical cores, so make sure hyper-threading is disabled or define the OMP_NUM_THREADS environment variable to the physical number of cores. Here I get 1.25s (without TB), and 0.95s with TB.

edited Oct 31 '13 at 12:40

answered Feb 10 '13 at 09:18

ggael

28,425
2
65
71

Good call on the multithreading: Matlab multithreading seems to explain most of the difference. When I use the "-singleCompThread" command line option, Matlab clocks in at ~2.4 sec, same as Eigen. – Mike Hughes Feb 11 '13 at 03:49
I can't think of any good reason why you would set OMP_NUM_THREADS to anything but the number of logical cores in this case. If you set it to the number of physical cores with HT enabled you'll just be idling 50% of the CPU... – quant Oct 30 '13 at 23:24
1

Basically, HT permits to hide memory latencies and enhance instruction pipelining by running two threads on the same physical core. However, a well implemented matrix product routine as in Eigen already occupies nearly 100% of the arithmetic units meanings pipelining is already perfect and memory latencies are already well hidden. In this context, HT can only destroy the performance. For instance, the two threads will concurrently access the same L1 cache resource thus cancelling the benefit of L1 cache. – ggael Oct 31 '13 at 12:37
Last I checked Eigen only used SSE and MKL used AVX. That's a factor of two loss. Since them Haswell has come out with FMA. I don't know if MKL supports FMA but if it does then that's potentially another factor of two loss (four total). My own GEMM code I have written is over twice as fast as Eigen using OpenMP and FMA. – Z boson Feb 19 '14 at 14:08

score 2 · Answer 2 · edited May 23 '17 at 11:46

2

The reason Matlab is faster is because it uses the Intel MKL. Eigen can use it too (see here), but you of course need to buy it.

That being said, there are a number of reasons Eigen can be slower. To compare python vs matlab vs Eigen, you'd really need to code three equivalent versions of an operations in the respective languages. Also note that Matlab caches results, so you'd really need to start from a fresh Matlab session to be sure its magic isn't fooling you.

Also, Matlab's Mex overhead is not nonexistent. The OP there reports newer versions "fix" the problem, but I'd be surprised if all overhead has been cleared completely.

edited May 23 '17 at 11:46

Community

1
1

answered Feb 10 '13 at 10:07

rubenvb

74,642
33
187
332

For my particular case (using R2011b), the overhead of the MEX call is *not* the primary cause. To verify, I wrote a [pure C++ version](https://gist.github.com/michaelchughes/4747789) of this test, and it gave the same timings as what I clocked the MEX at (~2.4 sec). I also ran my [MEX version](https://gist.github.com/michaelchughes/4742878) so that it only did the IO by commenting out the line that does the matrix product. This IO only bit (all the overhead) ran in ~.001 sec. – Mike Hughes Feb 11 '13 at 03:56
Also the Matlab "caching results" bit doesn't seem to explain things either. I ran many fresh starts (using the command line interface) and all clock in at ~1.3 sec. It's clearly the multithreading (see @ggael's post below). – Mike Hughes Feb 11 '13 at 04:01

score 2 · Answer 3 · answered Feb 19 '14 at 05:06

2

Eigen doesn't take advantage of the AVX instructions that were introduced by Intel with the Sandy Bridge architecture. This probably explains most of the performance difference between Eigen and MATLAB. I found a branch that adds support for AVX at https://bitbucket.org/benoitsteiner/eigen but as far as I can tell it not been merged in the Eigen trunk yet.

answered Feb 19 '14 at 05:06

mitthouseman

21
1

Yeah, and FMA adds another factor of two. I don't know if any library is supporting FMA yet. – Z boson Feb 19 '14 at 14:25

How to speed up Eigen library's matrix product?

3 Answers3

Linked