Eigen Matrix Multiplication Speed

Question

I was trying to do linear algebra numerical computation in C++. I used Python Numpy for quick model and I would like to find a C++ linear algebra pack for some further speed up. Eigen seems to be quite a good point to start.

I wrote a small performance test using large dense matrix multiplication to test the processing speed. In Numpy I was doing this:

import numpy as np
import time

a = np.random.uniform(size = (5000, 5000))
b = np.random.uniform(size = (5000, 5000))
start = time.time()
c = np.dot(a, b)
print (time.time() - start) * 1000, 'ms'

In C++ Eigen I was doing this:

#include <time.h>
#include "Eigen/Dense"

using namespace std;
using namespace Eigen;

int main() {
    MatrixXf a = MatrixXf::Random(5000, 5000);
    MatrixXf b = MatrixXf::Random(5000, 5000);
    time_t start = clock();
    MatrixXf c = a * b;
    cout << (double)(clock() - start) / CLOCKS_PER_SEC * 1000 << "ms" << endl;
    return 0;
}

I have done some search in the documents and on stackoverflow on the compilation optimization flags. I tried to compile the program using this command:

g++ -g test.cpp -o test -Ofast -msse2

The C++ executable compiled with -Ofast optimization flags runs about 30x or more faster than a simple no optimization compilation. It will return the result in roughly 10000ms on my 2015 macbook pro.

Meanwhile Numpy will return the result in about 1800ms.

I am expecting a boost of performance in using Eigen compared with Numpy. However, this failed my expectation.

Is there any compile flags I missed that will further boost the Eigen performance in this? Or is there any multithread switch that can be turn on to give me extra performance gain? I am just curious about this.

Thank you very much!

Edit on April 17, 2016:

After doing some search according to @ggael 's answer, I have come up with the answer to this question.

Best solution to this is compile with link to Intel MKL as backend for Eigen. for osx system the library can be found at here. With MKL installed I tried to use the Intel MKL link line advisor to enable MKL backend support for Eigen.

I compile in this manner for all MKL enablement:

g++ -DEIGEN_USE_MKL_ALL -L${MKLROOT}/lib -lmkl_intel_lp64 -lmkl_core -lmkl_intel_thread -liomp5 -lpthread -lm -ldl -m64 -I${MKLROOT}/include -I. -Ofast -DNDEBUG test.cpp -o test

If there is any environment variable error for MKLROOT just run the environment setup script provided in the MKL package which is installed default at /opt/intel/mkl/bin on my device.

With MKL as Eigen backend the matrix multiplication for two 5000x5000 operation will be finished in about 900ms on my 2.5Ghz Macbook Pro. This is much faster than Python Numpy on my device.

Are you sure you are running the above test cases? For 500x500 matrices I get benchmarks C++ 20ms, Python/Numpy: 310ms, for 5000x5000 matrices also C++ is an order of magnitude faster. (with -Ofast) — Charles Pehlivanian, Apr 16 '16 at 02:43
@CharlesPehlivanian I am using my python Numpy to calculate the 500x500 matrices and that gives me 3ms running time where Eigen is about 10ms. Still cannot get faster than Numpy. — yc2986, Apr 16 '16 at 05:42

score 3 · Accepted Answer · answered Apr 16 '16 at 13:41

To answer on the OSX side, first of all recall that on OSX g++ is actually an alias to clang++, and the current Apple's version of clang does not support openmp. Nonetheless, using Eigen3.3-beta-1, and default clang++, I get on a macbookpro 2.6Ghz:

$ clang++ -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG  &&  ./a.out
2954.91ms

Then to get support for multithreading, you need a recent clang of gcc compiler, for instance using homebrew or macport. Here using gcc 5 from macport, I get:

$ g++-mp-5 -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG -fopenmp -Wa,-q && ./a.out
804.939ms

and with clang 3.9:

$ clang++-mp-3.9 -mfma -I ../eigen so_gemm_perf.cpp  -O3 -DNDEBUG -fopenmp  && ./a.out
806.16ms

Remark that gcc on osx does not knowhow to properly assemble AVX/FMA instruction,so you need to tell it to use the native assembler with the -Wa,-q flag.

Finally, with the devel branch, you can also tell Eigen to use whatever BLAS as a backend, for instance the one from Apple's Accelerate as follows:

$ g++ -framework Accelerate -DEIGEN_USE_BLAS -O3 -DNDEBUG so_gemm_perf.cpp  -I ../eigen  && ./a.out
802.837ms

The -fopenmp compile really improved the performance. It is about 1200ms on my 2.5GHz macbook pro. And for the using BLAS as backend I was having trouble including mkl.h to the compile. Thanks for the solution. Really helpful! — yc2986, Apr 17 '16 at 18:58
For AVX and generic BLAS backend you need the devel branch (close to be the next release) — ggael, Apr 24 '16 at 09:12

score 1 · Answer 2 · answered Apr 16 '16 at 03:55

1

Compiling your little program with VC2013:

/fp:precise - 10.5s
/fp:strict - 10.4s
/fp:fast - 10.3s
/fp:fast /arch:AVX2 - 6.6s
/fp:fast /arch:AVX2 /openmp - 2.7s

So using AVX/AVX2 and enabling OpenMP is going to help a lot. You can also try linking against MKL (http://eigen.tuxfamily.org/dox/TopicUsingIntelMKL.html).

answered Apr 16 '16 at 03:55

kjpus

509
4
8

Is there a g++ or clang flags for OpenMP compilation in this case? What's more does that mean Eigen is slower than Python Numpy? Thanks. – yc2986 Apr 16 '16 at 05:23
@yc2986: Try `g++ -Wall -Wextra -ffast-math -O3 -march=native -fopenmp`. You have to use openMP pragmas for `-fopenmp` to do anything, but maybe Eigen already does use them. See also [a recent SO answer about compiler flags](http://stackoverflow.com/questions/36605576/which-gcc-optimization-flags-should-i-use/36610637#36610637). – Peter Cordes Apr 16 '16 at 05:52
4

Not an expert on numpy but my guess is that it uses highly optimized libraries (blas/LAPACK) under the hood. So there shouldn't be much difference in terms of speed. – kjpus Apr 16 '16 at 05:55
@PeterCordes I am using g++ under osx. When I try to compile the cpp using -fopenmp flag it is telling me there is no such flag. Is there anything I should add to the source code to enable this compile flag? – yc2986 Apr 16 '16 at 06:13
1

@yc2986: Maybe gcc needs to be compiled with support for it? Either that or you're using an ancient gcc version, in which case you should upgrade because newer gcc makes better code. – Peter Cordes Apr 16 '16 at 06:23
2

Searching this site yields plenty results.http://stackoverflow.com/questions/29057437/compile-openmp-programs-with-gcc-compiler-on-os-x-yosemite – kjpus Apr 16 '16 at 16:40

Eigen Matrix Multiplication Speed

2 Answers2