How to speed up matrix multiplication in C++?

Question

I'm performing matrix multiplication with this simple algorithm. To be more flexible I used objects for the matricies which contain dynamicly created arrays.

Comparing this solution to my first one with static arrays it is 4 times slower. What can I do to speed up the data access? I don't want to change the algorithm.

 matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int j = 0; j < a.dim(); j++) {
   int sum = 0;
   for (int k = 0; k < a.dim(); k++)
    sum += a(i,k) * b(k,j);
   c(i,j) = sum;
  }

 return c;
}

EDIT

I corrected my Question avove! I added the full source code below and tried some of your advices:

swapped k and j loop iterations -> performance improvement
declared dim() and operator()() as inline -> performance improvement
passing arguments by const reference -> performance loss! why? so I don't use it.

The performance is now nearly the same as it was in the old porgram. Maybe there should be a bit more improvement.

But I have another problem: I get a memory error in the function mult_strassen(...). Why?
terminate called after throwing an instance of 'std::bad_alloc' what(): std::bad_alloc

OLD PROGRAM
main.c http://pastebin.com/qPgDWGpW

c99 main.c -o matrix -O3

NEW PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr

g++ main.cpp matrix.cpp -o matrix -O3.

EDIT

Here are some results. Comparison between standard algorithm (std), swapped order of j and k loop (swap) and blocked algortihm with block size 13 (block). alt text

Did you mean to write a matrix multiply that only works on square matrices? Multiply is defined as long as the inner dimensions are equal. — Ben Voigt, Nov 29 '10 at 03:46
You are passing a and b by reference, right? You're not copying two matrices just to call this function? — chrisaycock, Nov 29 '10 at 03:53
You could also use [Eigen](http://eigen.tuxfamily.org/), which is particularly well-tuned. (Don't let the LGPL license scare you - it's a header-only library, and the "viral" terms of the LGPL don't hold. See [the FAQ](http://eigen.tuxfamily.org/index.php?title=Licensing_FAQ).) — greyfade, Nov 29 '10 at 04:37
@chrisaycock: the copies are elided by the compiler, passing by value is as fast or faster than by reference. — Inverse, Nov 29 '10 at 08:35
@Inverse: That isn't universally true. Have you seen the copy constructor code for `class matrix`? If not, you're just making a wild guess. — Ben Voigt, Nov 29 '10 at 12:26
I don't want to invent the whell a second time. It's a project to test some matrix multiplication algorithm. I startet implementing it in C and got these performance loss now in C++. I don't want to use this algorithm, I just want to measure the performance. So I don't need a libray. — multiholle, Nov 29 '10 at 13:37
@multiholle: Can you past the code into the question. The links provided are no longer working for me. — Martin York, Nov 29 '10 at 17:37
@multiholle: OK. I have a copy of the code. That's like comparing apples to oranges. 1) Put the code in the same test harness. 2) Do the same operations (In the C++ version you are not just multiplying you are also allocating memory). — Martin York, Nov 29 '10 at 18:52
@multiholle: I found the main problem. Your C code has a huge buffer overrun. This is resetting the dim global variable causing your multiplication to exit early. — Martin York, Nov 29 '10 at 19:04
@Martin York: Sorry this was an error an made, because of testing different ranges. — multiholle, Nov 30 '10 at 11:00
You've got memory leaks all over the place in that code, which is a big part of the reason pass-by-value is faster than pass-by-const-reference. Is `mult_strassen` your implementation of the "block access pattern" I suggested, or have you not posted that code? — Ben Voigt, Dec 02 '10 at 04:28

score 39 · Answer 1 · answered Nov 29 '10 at 03:57

Speaking of speed-up, your function will be more cache-friendly if you swap the order of the k and j loop iterations:

matrix mult_std(matrix a, matrix b) {
 matrix c(a.dim(), false, false);
 for (int i = 0; i < a.dim(); i++)
  for (int k = 0; k < a.dim(); k++)
   for (int j = 0; j < a.dim(); j++)  // swapped order
    c(i,j) += a(i,k) * b(k,j);

 return c;
}

That's because a k index on the inner-most loop will cause a cache miss in b on every iteration. With j as the inner-most index, both c and b are accessed contiguously, while a stays put.

score 5 · Answer 2 · answered Nov 29 '10 at 03:47

Make sure that the members dim() and operator()() are declared inline, and that compiler optimization is turned on. Then play with options like -funroll-loops (on gcc).

How big is a.dim() anyway? If a row of the matrix doesn't fit in just a couple cache lines, you'd be better off with a block access pattern instead of a full row at-a-time.

score 4 · Answer 3 · answered Nov 29 '10 at 15:49

You say you don't want to modify the algorithm, but what does that mean exactly?

Does unrolling the loop count as "modifying the algorithm"? What about using SSE/VMX whichever SIMD instructions are available on your CPU? What about employing some form of blocking to improve cache locality?

If you don't want to restructure your code at all, I doubt there's more you can do than the changes you've already made. Everything else becomes a trade-off of minor changes to the algorithm to achieve a performance boost.

Of course, you should still take a look at the asm generated by the compiler. That'll tell you much more about what can be done to speed up the code.

Using blocking speeds up the algorithm, great! – multiholle Nov 30 '10 at 10:25 — multiholle, Nov 30 '10 at 10:25

EboMike · Answer 4 · 2010-11-29T18:14:32.840

3

Use SIMD if you can. You absolutely have to use something like VMX registers if you do extensive vector math assuming you are using a platform that is capable of doing so, otherwise you will incur a huge performance hit.
Don't pass complex types like matrix by value - use a const reference.
Don't call a function in each iteration - cache dim() outside your loops.
Although compilers typically optimize this efficiently, it's often a good idea to have the caller provide a matrix reference for your function to fill out rather than returning a matrix by type. In some cases, this may result in an expensive copy operation.

edited Nov 29 '10 at 18:14

answered Nov 29 '10 at 03:58

EboMike

76,846
14
164
167

VMX? Do we know that he's running on PowerPC? – jalf Nov 29 '10 at 15:43
Hence my comment, "assuming you are using a platform that is capable that is doing so". My vision is a bit clouded - I'm almost exclusively programming on PPC these days, and some of my answers here get strange looks. – EboMike Nov 29 '10 at 16:28
Ah, gotcha. I just wasn't sure because first you mentioned SIMD (in general), which made sense, and then you narrowed it down to VMX only. – jalf Nov 29 '10 at 17:36
Yeah, definitely a mistake on my part. I slightly edited the answer to make more sense :) Thanks for pointing that out, I know I'm living in a cave sometimes! – EboMike Nov 29 '10 at 18:15
1

By the way, not sure if you know this (but others who read your answer may not): the x86 "equivalent" of VMX is called SSE – jalf Nov 30 '10 at 12:24

score 2 · Answer 5 · answered Jul 18 '14 at 13:24

Here is my implementation of the fast simple multiplication algorithm for square float matrices (2D arrays). It should be a little faster than chrisaycock code since it spares some increments.

static void fastMatrixMultiply(const int dim, float* dest, const float* srcA, const float* srcB)
{
    memset( dest, 0x0, dim * dim * sizeof(float) );

    for( int i = 0; i < dim; i++ ) {
        for( int k = 0; k < dim; k++ ) 
        {
            const float* a = srcA + i * dim + k;
            const float* b = srcB + k * dim;
            float* c = dest + i * dim;

            float* cMax = c + dim;
            while( c < cMax ) 
            {   
                *c++ += (*a) * (*b++);
            }
        }
    }
}

score 1 · Accepted Answer · edited Jun 20 '20 at 09:12

Pass the parameters by const reference to start with:

matrix mult_std(matrix const& a, matrix const& b) {

To give you more details we need to know the details of the other methods used.
And to answer why the original method is 4 times faster we would need to see the original method.

The problem is undoubtedly yours as this problem has been solved a million times before.

Also when asking this type of question ALWAYS provide compilable source with appropriate inputs so we can actually build and run the code and see what is happening.

Without the code we are just guessing.

Edit

After fixing the main bug in the original C code (a buffer over-run)

I have update the code to run the test side by side in a fair comparison:

 // INCLUDES -------------------------------------------------------------------
 #include <stdlib.h>
 #include <stdio.h>
 #include <sys/time.h>
 #include <time.h>

 // DEFINES -------------------------------------------------------------------
 // The original problem was here. The MAXDIM was 500. But we were using arrays
 // that had a size of 512 in each dimension. This caused a buffer overrun that
 // the dim variable and caused it to be reset to 0. The result of this was causing
 // the multiplication loop to fall out before it had finished (as the loop was
 // controlled by this global variable.
 //
 // Everything now uses the MAXDIM variable directly.
 // This of course gives the C code an advantage as the compiler can optimize the
 // loop explicitly for the fixed size arrays and thus unroll loops more efficiently.
 #define MAXDIM 512
 #define RUNS 10

 // MATRIX FUNCTIONS ----------------------------------------------------------
 class matrix
 {
 public:
 matrix(int dim)
       : dim_(dim)
 {
         data_ = new int[dim_ * dim_];

 }

     inline int dim() const {
                         return dim_;
                 }
                 inline int& operator()(unsigned row, unsigned col) {
                         return data_[dim_*row + col];
                 }

                 inline int operator()(unsigned row, unsigned col) const {
                         return data_[dim_*row + col];
                 }

 private:
     int dim_;
     int* data_;
 };

// ---------------------------------------------------
 void random_matrix(int (&matrix)[MAXDIM][MAXDIM]) {
         for (int r = 0; r < MAXDIM; r++)
                 for (int c = 0; c < MAXDIM; c++)
                         matrix[r][c] = rand() % 100;
 }
 void random_matrix_class(matrix& matrix) {
         for (int r = 0; r < matrix.dim(); r++)
                 for (int c = 0; c < matrix.dim(); c++)
                         matrix(r, c) = rand() % 100;
 }

 template<typename T, typename M>
 float run(T f, M const& a, M const& b, M& c)
 {
         float time = 0;

         for (int i = 0; i < RUNS; i++) {
                 struct timeval start, end;
                 gettimeofday(&start, NULL);
                 f(a,b,c);
                 gettimeofday(&end, NULL);

                 long s = start.tv_sec * 1000 + start.tv_usec / 1000;
                 long e = end.tv_sec * 1000 + end.tv_usec / 1000;

                 time += e - s;
         }
         return time / RUNS;
 }
 // SEQ MULTIPLICATION ----------------------------------------------------------
  int* mult_seq(int const(&a)[MAXDIM][MAXDIM], int const(&b)[MAXDIM][MAXDIM], int (&z)[MAXDIM][MAXDIM]) {
          for (int r = 0; r < MAXDIM; r++) {
                  for (int c = 0; c < MAXDIM; c++) {
                          z[r][c] = 0;
                          for (int i = 0; i < MAXDIM; i++)
                                  z[r][c] += a[r][i] * b[i][c];
                  }
          }
  }
  void mult_std(matrix const& a, matrix const& b, matrix& z) {
          for (int r = 0; r < a.dim(); r++) {
                  for (int c = 0; c < a.dim(); c++) {
                          z(r,c) = 0;
                          for (int i = 0; i < a.dim(); i++)
                                  z(r,c) += a(r,i) * b(i,c);
                  }
          }
  }

  // MAIN ------------------------------------------------------------------------
  using namespace std;
  int main(int argc, char* argv[]) {
          srand(time(NULL));

          int matrix_a[MAXDIM][MAXDIM];
          int matrix_b[MAXDIM][MAXDIM];
          int matrix_c[MAXDIM][MAXDIM];
          random_matrix(matrix_a);
          random_matrix(matrix_b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_seq, matrix_a, matrix_b, matrix_c));

          matrix a(MAXDIM);
          matrix b(MAXDIM);
          matrix c(MAXDIM);
          random_matrix_class(a);
          random_matrix_class(b);
          printf("%d ", MAXDIM);
          printf("%f \n", run(mult_std, a, b, c));

          return 0;
  }

The results now:

$ g++ t1.cpp
$ ./a.exe
512 1270.900000
512 3308.800000

$ g++ -O3 t1.cpp
$ ./a.exe
512 284.900000
512 622.000000

From this we see the C code is about twice as fast as the C++ code when fully optimized. I can not see the reason in the code.

want speed? pass by value: http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/ — Inverse, Nov 29 '10 at 08:37
as the OP says in the edited question, const reference turned out to be slower. It's always worth trying, but it interacts with so many different compiler optimizations that it's far from a safe bet. Sometimes it's faster, sometimes it's slower. — jalf, Nov 29 '10 at 15:44
@jalf: I also agree with your comments in general, but in this specific case it is a problem with the code submitted. — Martin York, Nov 29 '10 at 17:35
@ Inverse: Please read the article more carefully. I totally agree with the article and the use of copy by value is really really really useful especially when the compiler is doing RVO and NRVO. BUT that is not relevant here as we need to make a new array to be returned (we can not optimize this away the result is a NEW array). — Martin York, Nov 29 '10 at 17:36
@Inverse: That article provides a one-line summary: Guideline: Don’t copy your function arguments. Instead, pass them by value and let the compiler do the copying. No copies of the function arguments are being made, therefore the article does NOT prescribe pass-by-value. — Ben Voigt, Nov 29 '10 at 21:06
I think the main problem is the overhead because of using classes. — multiholle, Nov 30 '10 at 19:51
@multiholle: In C++ there is hardly any overhead in using classes. — Entalpi, Aug 03 '16 at 09:59
The link by @Inverse does not work. This should: https://web.archive.org/web/20140205194657/http://cpp-next.com/archive/2009/08/want-speed-pass-by-value/ — Karan Shah, Oct 28 '20 at 14:23

score -1 · Answer 7 · answered Nov 29 '10 at 03:48

I'm taking a wild guess here, but if you dynamically allocating the matrices makes such a huge difference, maybe the problem is fragmentation. Again, I've no idea how the underlying matrix is implemented.

Why don't you allocate the memory for the matrices by hand, ensuring it's contiguous, and build the pointer structure yourself?

Also, does the dim() method have any extra complexity? I would declare it inline, too.

How to speed up matrix multiplication in C++?

7 Answers7

Edit

Linked

Related