27

In a program I am working on, I need to multiply two matrices repeatedly. Because of the size of one of the matrices, this operation takes some time and I wanted to see which method would be the most efficient. The matrices have dimensions (m x n)*(n x p) where m = n = 3 and 10^5 < p < 10^6.

With the exception of Numpy, which I assume works with an optimized algorithm, every test consists of a simple implementation of the matrix multiplication:

Matrix multiplication

Below are my various implementations:

Python

def dot_py(A,B):
    m, n = A.shape
    p = B.shape[1]

    C = np.zeros((m,p))

    for i in range(0,m):
        for j in range(0,p):
            for k in range(0,n):
                C[i,j] += A[i,k]*B[k,j] 
    return C

Numpy

def dot_np(A,B):
    C = np.dot(A,B)
    return C

Numba

The code is the same as the Python one, but it is compiled just in time before being used:

dot_nb = nb.jit(nb.float64[:,:](nb.float64[:,:], nb.float64[:,:]), nopython = True)(dot_py)

So far, each method call has been timed using the timeit module 10 times. The best result is kept. The matrices are created using np.random.rand(n,m).

C++

mat2 dot(const mat2& m1, const mat2& m2)
{
    int m = m1.rows_;
    int n = m1.cols_;
    int p = m2.cols_;

    mat2 m3(m,p);

    for (int row = 0; row < m; row++) {
        for (int col = 0; col < p; col++) {
            for (int k = 0; k < n; k++) {
                m3.data_[p*row + col] += m1.data_[n*row + k]*m2.data_[p*k + col];
            }
        }
    }

    return m3;
}

Here, mat2 is a custom class that I defined and dot(const mat2& m1, const mat2& m2) is a friend function to this class. It is timed using QPF and QPC from Windows.h and the program is compiled using MinGW with the g++ command. Again, the best time obtained from 10 executions is kept.

Results

Results

As expected, the simple Python code is slower but it still beats Numpy for very small matrices. Numba turns out to be about 30% faster than Numpy for the largest cases.

I am surprised with the C++ results, where the multiplication takes almost an order of magnitude more time than with Numba. In fact, I expected these to take a similar amount of time.

This leads to my main question: Is this normal and if not, why is C++ slower that Numba? I just started learning C++ so I might be doing something wrong. If so, what would be my mistake, or what could I do to improve the efficiency of my code (other than choosing a better algorithm) ?

EDIT 1

Here is the header of the mat2 class.

#ifndef MAT2_H
#define MAT2_H

#include <iostream>

class mat2
{
private:
    int rows_, cols_;
    float* data_;

public: 
    mat2() {}                                   // (default) constructor
    mat2(int rows, int cols, float value = 0);  // constructor
    mat2(const mat2& other);                    // copy constructor
    ~mat2();                                    // destructor

    // Operators
    mat2& operator=(mat2 other);                // assignment operator

    float operator()(int row, int col) const;
    float& operator() (int row, int col);

    mat2 operator*(const mat2& other);

    // Operations
    friend mat2 dot(const mat2& m1, const mat2& m2);

    // Other
    friend void swap(mat2& first, mat2& second);
    friend std::ostream& operator<<(std::ostream& os, const mat2& M);
};

#endif

Edit 2

As many suggested, using the optimization flag was the missing element to match Numba. Below are the new curves compared to the previous ones. The curve tagged v2 was obtained by switching the two inner loops and shows another 30% to 50% improvement.

Results v2

JD80121
  • 591
  • 1
  • 6
  • 13
  • 4
    That is surprising...I can't imagine you will see extremely massive speedups but have you tried using compiler optimization flags such as `-O3`? Basic usage is `g++ *.cpp -std=c++11 -O3` – MS-DDOS Apr 10 '16 at 06:46
  • You could optimise the C++ a little. For example, in the inner loop you `p*row + col` on every iteration of `k`, even though the result of that does not change. This could be moved to the loop above. Check all the calculations in this way. Also ensure you are compiling with full optimisation (compiler dependant, but often `-O2`). It could be rewritten to use pointers instead, but optimisers usually do that for you. – cdarke Apr 10 '16 at 06:46
  • 2
    Also are you calling this c++ function _from_ python in any way or are you directly invoking a compiled program? – MS-DDOS Apr 10 '16 at 06:47
  • where's the code for mat2? does it have a move constructor? Not sure if this would speed up things a little. A @cdarke suggested there are a couple of other optimisations you can do. You should also try this with a couple of different compilers, too - with full optimisations. This is not relevant here, but generally you might want to consider `++a` instead of` a++` - the latter needs additional memory. – Marinos K Apr 10 '16 at 06:49
  • @TylerS I am calling the c++ function from the compiled program. I have not used the compiler flags, I will try them and let you know the result. – JD80121 Apr 10 '16 at 06:51
  • @cdarke: I'd hope the compiler would be able to optimize those repeated calculations away – Eric Apr 10 '16 at 06:55
  • 1
    @Eric: that's a hope, but no excuse for writing code in that way. A bit like expecting your wife to tidy-up after you :-) – cdarke Apr 10 '16 at 06:56
  • Expecting compiler optimizations is totally a reason to write code in a more readable way at the expense of unoptimized performance – Eric Apr 10 '16 at 06:58
  • @MarinosK There is no move constructor. I will try the optimization suggested by @cdarke (and others if possible) and see if there is any difference. I have already tried using `++a` but there was no noticeable change in execution time. – JD80121 Apr 10 '16 at 06:58
  • You have a trivial approach solving the matrix, while the specialized libraries likely have a more optimized solution. See: https://en.wikipedia.org/wiki/Matrix_multiplication_algorithm –  Apr 10 '16 at 06:59
  • @JD80121 Could you also post the header or something for the `mat2` class/struct/etc, I'm wondering if you're not doing something funky there. – MS-DDOS Apr 10 '16 at 07:02
  • 1
    Look up cache miss, this is likely one of the places where your C++ fails. – Reblochon Masque Apr 10 '16 at 07:07
  • @cdarke try with a move constructor.. The copy in the end is almost certainly optimised anyway, but it's better to be explicit. – Marinos K Apr 10 '16 at 08:51
  • @reblochon-masque : for 3x3 matrix like here, i don't think locality is a problem. and Numba & C++ codes deal with indices in the same order. – B. M. Apr 10 '16 at 09:11
  • @BM: dimensions are `(m x n)*(n x p) where m = n = 3 and 10^5 < p < 10^6` – Reblochon Masque Apr 10 '16 at 09:37
  • C++ falls behind for sizes around `[3x3]x[3x100]`; * 4 or 8 bytes for float or long... does that match the size of L1 cache? – Reblochon Masque Apr 10 '16 at 09:43
  • @JD80121 Any update on using compiler optimizations? – MS-DDOS Apr 12 '16 at 04:40
  • 1
    @TylerS I updated my question (see the second edit) with the results using `-O3`. Is this what you are looking for? – JD80121 Apr 12 '16 at 18:04

4 Answers4

11

Definitely use -O3 for optimization. This turns vectorizations on, which should significantly speed your code up.

Numba is supposed to do that already.

Community
  • 1
  • 1
zmbq
  • 38,013
  • 14
  • 101
  • 171
10

What I would recommend

If you want maximum efficiency, you should use a dedicated linear algebra library, the classic of which is BLAS/LAPACK libraries. There are a number of implementations, eg. Intel MKL. What you write is NOT going to outpeform hyper-optimized libraries.

Matrix matrix multiply is going to be the dgemm routine: d stands for double, ge for general, and mm for matrix matrix multiply. If your problem has additional structure, a more specific function may be called for additional speedup.

Note that Numpy dot ALREADY calls dgemm! You're probably not going to do better.

Why your c++ is slow

Your classic, intuitive algorithm for matrix-matrix multiplication turns out to be slow compared to what's possible. Writing code that takes advantage of how processors cache etc... yields important performance gains. The point is, tons of smart people have devoted their lives to making matrix matrix multiply extremely fast, and you should use their work and not reinvent the wheel.

Matthew Gunn
  • 4,451
  • 1
  • 12
  • 30
  • Thanks for your answer! I knew that Numpy was using `dgemm` (in fact I have already taken a look at the Fortran code). I expected it to perform better for this reason. I used the O(n^3) algorithm for simplicity since I was already getting better results with it than with Numpy. Eventually, my code will contain more custom functions with nested loops that are not available in optimized libraries, and I now have a better idea of how I should implement them. – JD80121 Apr 10 '16 at 18:58
  • I think the optimized `dgemm` routines outerperform naive implementations largely due to caching and other techniques to take advantage of how processors actually work rather than the O(n^3) bit. I'm really not an expert on the details though. – Matthew Gunn Apr 10 '16 at 19:15
3

In your current implementation most likely compiler is unable to auto vectorize the most inner loop because its size is 3. Also m2 is accessed in a "jumpy" way. Swapping loops so that iterating over p is in the most inner loop will make it work faster (col will not make "jumpy" data access) and compiler should be able to do better job (autovectorize).

for (int row = 0; row < m; row++) {
    for (int k = 0; k < n; k++) {
        for (int col = 0; col < p; col++) {
            m3.data_[p*row + col] += m1.data_[n*row + k] * m2.data_[p*k + col];
        }
    }
}

On my machine the original C++ implementation for p=10^6 elements build with g++ dot.cpp -std=c++11 -O3 -o dot flags takes 12ms and above implementation with swapped loops takes 7ms.

doqtor
  • 8,414
  • 2
  • 20
  • 36
1

You can still optimize these loops by improving the memory acces, your function could look like (assuming the matrizes are 1000x1000):

CS = 10
NCHUNKS = 100

def dot_chunked(A,B):
    C = np.zeros(1000,1000)

    for i in range(NCHUNKS):
        for j in range(NCHUNKS):
            for k in range(NCHUNKS):
                for ii in range(i*CS,(i+1)*CS):
                    for jj in range(j*CS,(j+1)*CS):
                        for kk in range(k*CS,(k+1)*CS):
                            C[ii,jj] += A[ii,kk]*B[kk,jj] 
    return C

Explanation: the loops i and ii obviously together perform the same way as i did before, the same hold for j and k, but this time regions in A and B of size CSxCS can be kept in the cache (I guess) and can used more then once.

You can play around with CS and NCHUNKS. For me CS=10 and NCHUNKS=100 worked well. When using numba.jit, it accelerates the code from 7s to 850 ms (notice i use 1000x1000, the graphics above are run with 3x3x10^5, so its a bit of another scenario).

Ply
  • 11
  • 3