Eigen + MKL or OpenBLAS slower than Numpy/Scipy + OpenBLAS

Question

I'm starting with c++ atm and want to work with matrices and speed up things in general. Worked with Python+Numpy+OpenBLAS before. Thought c++ + Eigen + MKL might be faster or at least not slower.

My c++ code:

#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>

using namespace std;
using namespace Eigen;

int main()
{
    int n = Eigen::nbThreads( );
    cout << "#Threads: " << n << endl;

    uint16_t size = 4000;
    MatrixXd a = MatrixXd::Random(size,size);

    clock_t start = clock ();
    PartialPivLU<MatrixXd> lu = PartialPivLU<MatrixXd>(a);

    float timeElapsed = double( clock() - start ) / CLOCKS_PER_SEC; 
    cout << "Elasped time is " << timeElapsed << " seconds." << endl ;
}

My Python code:

import numpy as np
from time import time
from scipy import linalg as la

size = 4000

A = np.random.random((size, size))

t = time()
LU, piv = la.lu_factor(A)
print(time()-t)

My timings:

C++     2.4s
Python  1.2s

Why is c++ slower than Python?

I am compiling c++ using:

g++ main.cpp -o main -lopenblas -O3 -fopenmp  -DMKL_LP64 -I/usr/local/include/mkl/include

MKL is definiely working: If I disable it the running time is around 13s.

I also tried C++ + OpenBLAS which gives me around 2.4s as well.

Any ideas why C++ and Eigen are slower than numpy/scipy?

The code doing the actual work for Numpy is written in C. If you want to understand the speed difference, you need to look at the algorithms the different underlying libraries use. What laguage you write the program making the library call in is completely irrelevant. — Sven Marnach, Sep 13 '17 at 20:51
Yes @SvenMarnach I don't expect that c++ is faster I just would expect that they should have a similar speed because they should both use MKL or OpenBLAS in the back end. — Wikunia, Sep 13 '17 at 20:53
What's your CPU? Here, on a 2.6GHz core7 (Haswell), I get 1.4s using Eigen only and no openmp, and 0.7s if using openmp. Of course you need to compile with -march=native to fully exploit your CPU. — ggael, Sep 13 '17 at 21:20
@ggael Intel(R) Core(TM) i5-3320M CPU @ 2.60GHz Looks like Henri Menke found the bug ;) — Wikunia, Sep 13 '17 at 21:24
Well, the speed ratio between Eigen and MKL is also too much, make sure to compile with `-march=native` and Eigen 3.x to leverage the AVX instructions of your CPU. Since I get 0.7s here, you should get something between 1s and max 2s. — ggael, Sep 13 '17 at 21:37

score 4 · Accepted Answer · answered Sep 13 '17 at 21:11

The timing is just wrong. That's a typical symptom of wall clock time vs. CPU time. When I use the system_clock from the <chrono> header it “magically” becomes faster.

#define EIGEN_USE_MKL_ALL
#include <iostream>
#include <Eigen/Dense>
#include <Eigen/LU>
#include <chrono>

int main()
{
    int const n = Eigen::nbThreads( );
    std::cout << "#Threads: " << n << std::endl;

    int const size = 4000;
    Eigen::MatrixXd a = Eigen::MatrixXd::Random(size,size);

    auto start = std::chrono::system_clock::now();

    Eigen::PartialPivLU<Eigen::MatrixXd> lu(a);

    auto stop = std::chrono::system_clock::now();

    std::cout << "Elasped time is "
              << std::chrono::duration<double>{stop - start}.count()
              << " seconds." << std::endl;
}

I compile with

icc -O3 -mkl -std=c++11 -DNDEBUG -I/usr/include/eigen3/ test.cpp

and get the output

#Threads: 1
Elasped time is 0.295782 seconds.

Your Python version reports 0.399146080017 on my machine.

Alternatively, to obtain comparable timing you could use time.clock() (CPU time) in Python instead of time.time() (wall clock time).

score 0 · Answer 2 · answered Sep 13 '17 at 20:59

0

This is not a fair comparison. The python routine is operating on float precision while the c++ code needs to crunch doubles. This exactly doubles the computation time.

>>> type(np.random.random_sample())
<type 'float'>

You should compare with MatrixXf instead of MatrixXd and your MKL code should be equally fast.

answered Sep 13 '17 at 20:59

Kaveh Vahedipour

3,412
1
14
22

They are indeed equally fast then BUT numpy uses `float64` which is the same as double in C++. – Wikunia Sep 13 '17 at 21:01
Numpy uses openblas and MKL too. So its using the 32bit precision for the LU factorization anyway. – Kaveh Vahedipour Sep 13 '17 at 21:10

Eigen + MKL or OpenBLAS slower than Numpy/Scipy + OpenBLAS

2 Answers2