22

Take the product of two 3x3 matrices A*B=C. Naively this requires 27 multiplications using the standard algorithm. If one were clever, you could do this using only 23 multiplications, a result found in 1973 by Laderman. The technique involves saving intermediate steps and combining them in the right way.

Now lets fix a language and a type, say C++ with elements of double. If the Laderman algorithm was hard-coded versus the simple double loop, could we expect the performance of a modern compiler to edge out the differences of the algorithms?

Notes about this question: This is a programming site, and the question is asked in the context of the best practice for a time-critical inner loop; premature optimization this is not. Tips on implementation are greatly welcomed as comments.

phs
  • 10,687
  • 4
  • 58
  • 84
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • 4
    "...could we expect the performance of a modern compiler to edge out the differences of the algorithms?" Why not try it out? Code the two up, run them each 1000 times and compare run times. – AndyPerfect May 31 '12 at 03:55
  • 2
    The general answer to that question is "no." Clever algorithms are still needed in the world. – phs May 31 '12 at 04:01
  • @phs: Answer to the question in the title, or to the question just above the note? They're opposite. – MSalters May 31 '12 at 08:12
  • 2
    There is also a technique due to Makarov which needs only 22 multiplications ([original in Russian, PDF](http://www.mathnet.ru/links/4c474953a5b87c840ad1d468b86738f8/zvmmf4056.pdf); [English translation, behind paywall](http://dx.doi.org/10.1016/0041-5553(86)90203-X)) – Jitse Niesen May 31 '12 at 08:23
  • @AndyPerfect Indeed I did just that, and the results were surprising. See my answer. As usual for benchmarking questions, the answer is, "it depends". – Hooked May 31 '12 at 15:09
  • 1
    Late comment: The advantages of Laderman's algorithm only become visible, if it is used recursively for larger matrices. This was analyzed experimentally by Paolo D'Alberto in his [FastMMW site](http://www.fastmmw.com/2017/10/29/my-first-23-product-fast-matrix-multiplication-3x3x3/). In terms of multiplication complexity, Laderman's algorithm O(n^2.854) is better than naive O(n^3) but inferior compared to Strassen's algorithm O(n^2.807). A 3x3 algorithm with 21 multiplications O(n^2.77) would be superior to Strassen's. – Axel Kemper Jul 02 '18 at 08:48
  • 21 multiplications algorithm exists now. It is commutative though. https://arxiv.org/abs/1904.07683 – Валерий Заподовников Oct 06 '22 at 06:32

4 Answers4

15

The key is mastering the instruction set on your platform. It depends on your platform. There are several techniques, and when you tend to need the maximum possible performance, your compiler will come with profiling tools, some of which have optimization hinting built in. For the finest grained operations look at the assembler output and see if there are any improvements at that level as well.

Simultaneous instruction multiple data commands perform the same operation on several operands in parallel. So that you can take

double a,b,c,d;
double w = d + a; 
double x = a + b;
double y = b + c;
double z = c + d;

and replace it with

double256 dabc = pack256(d, a, b, c);
double256 abcd = pack256(a, b, c, d);
double256 wxyz = dabc + abcd;

So that when the values are loaded into registers, they are loaded into a single 256-bit wide register for some fictional platform with 256-bit wide registers.

Floating point is an important consideration, some DSPs can be significantly faster operating on integers. GPUs tend to be great on floating point, although some are 2x faster at single precision. The 3x3 case of this problem could fit into a single CUDA thread, so you could stream 256 of these calculations simultaneously.

Pick your platform, read the documentation, implement several different methods and profile them.

totowtwo
  • 2,101
  • 1
  • 14
  • 21
11

Timing Tests:

I ran the timing tests myself and the results surprised me (hence why I asked the question in the first place). The short of it is, under a standard compile the laderman is ~ 225% faster, but with the -03 optimizing flag it is 50% slower! I had to add a random element into the matrix each time during the -O3 flag or the compiler completely optimized away the simple multiplication, taking a time of zero within clock precision. Since the laderman algorithm was a pain to check/double check I'll post the complete code below for posterity.

Specs: Ubuntu 12.04, Dell Prevision T1600, gcc. Percent difference in timings:

  • g++ [2.22, 2.23, 2.27]
  • g++ -O3 [-0.48, -0.49, -0.48]
  • g++ -funroll-loops -O3 [-0.48, -0.48, -0.47] 

Benchmarking code along with Laderman implementation:

#include <iostream>
#include <ctime>
#include <cstdio>
#include <cstdlib>
using namespace std;

void simple_mul(const double a[3][3], 
        const double b[3][3],
        double c[3][3]) {
  int i,j,m,n;
  for(i=0;i<3;i++) {
    for(j=0;j<3;j++) {
      c[i][j] = 0;
      for(m=0;m<3;m++) 
    c[i][j] += a[i][m]*b[m][j];
    }
  }
}

void laderman_mul(const double a[3][3], 
           const double b[3][3],
           double c[3][3]) {

   double m[24]; // not off by one, just wanted to match the index from the paper

   m[1 ]= (a[0][0]+a[0][1]+a[0][2]-a[1][0]-a[1][1]-a[2][1]-a[2][2])*b[1][1];
   m[2 ]= (a[0][0]-a[1][0])*(-b[0][1]+b[1][1]);
   m[3 ]= a[1][1]*(-b[0][0]+b[0][1]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][2]);
   m[4 ]= (-a[0][0]+a[1][0]+a[1][1])*(b[0][0]-b[0][1]+b[1][1]);
   m[5 ]= (a[1][0]+a[1][1])*(-b[0][0]+b[0][1]);
   m[6 ]= a[0][0]*b[0][0];
   m[7 ]= (-a[0][0]+a[2][0]+a[2][1])*(b[0][0]-b[0][2]+b[1][2]);
   m[8 ]= (-a[0][0]+a[2][0])*(b[0][2]-b[1][2]);
   m[9 ]= (a[2][0]+a[2][1])*(-b[0][0]+b[0][2]);
   m[10]= (a[0][0]+a[0][1]+a[0][2]-a[1][1]-a[1][2]-a[2][0]-a[2][1])*b[1][2];
   m[11]= a[2][1]*(-b[0][0]+b[0][2]+b[1][0]-b[1][1]-b[1][2]-b[2][0]+b[2][1]);
   m[12]= (-a[0][2]+a[2][1]+a[2][2])*(b[1][1]+b[2][0]-b[2][1]);
   m[13]= (a[0][2]-a[2][2])*(b[1][1]-b[2][1]);
   m[14]= a[0][2]*b[2][0];
   m[15]= (a[2][1]+a[2][2])*(-b[2][0]+b[2][1]);
   m[16]= (-a[0][2]+a[1][1]+a[1][2])*(b[1][2]+b[2][0]-b[2][2]);
   m[17]= (a[0][2]-a[1][2])*(b[1][2]-b[2][2]);
   m[18]= (a[1][1]+a[1][2])*(-b[2][0]+b[2][2]);
   m[19]= a[0][1]*b[1][0];
   m[20]= a[1][2]*b[2][1];
   m[21]= a[1][0]*b[0][2];
   m[22]= a[2][0]*b[0][1];
   m[23]= a[2][2]*b[2][2];

  c[0][0] = m[6]+m[14]+m[19];
  c[0][1] = m[1]+m[4]+m[5]+m[6]+m[12]+m[14]+m[15];
  c[0][2] = m[6]+m[7]+m[9]+m[10]+m[14]+m[16]+m[18];
  c[1][0] = m[2]+m[3]+m[4]+m[6]+m[14]+m[16]+m[17];
  c[1][1] = m[2]+m[4]+m[5]+m[6]+m[20];
  c[1][2] = m[14]+m[16]+m[17]+m[18]+m[21];
  c[2][0] = m[6]+m[7]+m[8]+m[11]+m[12]+m[13]+m[14];
  c[2][1] = m[12]+m[13]+m[14]+m[15]+m[22];
  c[2][2] = m[6]+m[7]+m[8]+m[9]+m[23];    
}

int main() {
  int N = 1000000000;
  double A[3][3], C[3][3];
  std::clock_t t0,t1;
  timespec tm0, tm1;

  A[0][0] = 3/5.; A[0][1] = 1/5.; A[0][2] = 2/5.;
  A[1][0] = 3/7.; A[1][1] = 1/7.; A[1][2] = 3/7.;
  A[2][0] = 1/3.; A[2][1] = 1/3.; A[2][2] = 1/3.;

  t0 = std::clock();
  for(int i=0;i<N;i++) {
    // A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
    simple_mul(A,A,C);
  }
  t1 = std::clock();
  double tdiff_simple = (t1-t0)/1000.;

  cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
  cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
  cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
  cout << tdiff_simple << endl;
  cout << endl;

  t0 = std::clock();
  for(int i=0;i<N;i++) {
    // A[0][0] = double(rand())/RAND_MAX; // Keep this in for -O3
    laderman_mul(A,A,C);
  }
  t1 = std::clock();
  double tdiff_laderman = (t1-t0)/1000.;

  cout << C[0][0] << ' ' << C[0][1] << ' ' << C[0][2] << endl;
  cout << C[1][0] << ' ' << C[1][1] << ' ' << C[1][2] << endl;
  cout << C[2][0] << ' ' << C[2][1] << ' ' << C[2][2] << endl;
  cout << tdiff_laderman << endl;
  cout << endl;

  double speedup = (tdiff_simple-tdiff_laderman)/tdiff_laderman;
  cout << "Approximate speedup: " << speedup << endl;

  return 0;
}
Hooked
  • 84,485
  • 43
  • 192
  • 261
  • Have you tried using more optimization options, like -funroll-loops or using sse? – PlasmaHH May 31 '12 at 15:15
  • No, but I'll give them a shot (I'm new to optimizations), any more you suggest? – Hooked May 31 '12 at 15:18
  • Playing with various things like cmd line options (just read through the manpage and try things out. also gcc has an attribute to set optimization flags, so you can maybe hack something up that will test multiple such flags in one binary) and the __restrict gcc extension on http://gcc.godbolt.org often gives a nice idea about what is going on. – PlasmaHH May 31 '12 at 15:37
  • @PlasmaHH I've edited the answer to reflect the timings with unroll, no real change. I suspect that whatever flags I implement the simple version may end up being faster, though I'll play around with it a bit. – Hooked May 31 '12 at 15:40
  • 21 multiplications from this paper https://arxiv.org/abs/1904.07683: https://gcc.godbolt.org/z/9oszvM9aG – Валерий Заподовников Oct 06 '22 at 09:21
3

Although the question mentioned C++, I implemented 3x3 matrix multiplication C=A*B in C# (.NET 4.5) and ran some basic timing tests on my 64 bit windows 7 machine with optimizations. 10,000,000 multiplications took about

  1. 0.556 seconds with a naive implementation and
  2. 0.874 seconds with the laderman code from the other answer.

Interestingly, the laderman code was slower than the naive way. I didn't investigate with a profiler, but I guess the extra allocations are more costly than a few extra multiplications.

It seems current compilers are smart enough to do those optimizations for us, which is good. Here's the naive code I used, for your interest:

    public static Matrix3D operator *(Matrix3D a, Matrix3D b)
    {
        double c11 = a.M11 * b.M11 + a.M12 * b.M21 + a.M13 * b.M31;
        double c12 = a.M11 * b.M12 + a.M12 * b.M22 + a.M13 * b.M32;
        double c13 = a.M11 * b.M13 + a.M12 * b.M23 + a.M13 * b.M33;
        double c21 = a.M21 * b.M11 + a.M22 * b.M21 + a.M23 * b.M31;
        double c22 = a.M21 * b.M12 + a.M22 * b.M22 + a.M23 * b.M32;
        double c23 = a.M21 * b.M13 + a.M22 * b.M23 + a.M23 * b.M33;
        double c31 = a.M31 * b.M11 + a.M32 * b.M21 + a.M33 * b.M31;
        double c32 = a.M31 * b.M12 + a.M32 * b.M22 + a.M33 * b.M32;
        double c33 = a.M31 * b.M13 + a.M32 * b.M23 + a.M33 * b.M33;
        return new Matrix3D(
            c11, c12, c13,
            c21, c22, c23,
            c31, c32, c33);
    }

where Matrix3D is an immutable struct (readonly double fields).

The tricky thing is to come up with a valid benchmark, where you measure your code and not, what the compiler did with your code (debugger with tons of extra stuff, or optimized without your actual code since the result was never used). I usually try to "touch" the result, such that the compiler cannot remove the code under test (e.g. check matrix elements for equality with 89038.8989384 and throw, if equal). However, in the end I'm not even sure if the compiler hacks this comparison out of the way :)

Patrick Stalph
  • 792
  • 1
  • 9
  • 19
2

I expect the main performance concern would be memory latency. A double[9] is typically 72 bytes. That's already a nontrivial amount, and you're using three of them.

MSalters
  • 173,980
  • 10
  • 155
  • 350
  • Is 72 bytes too big to fit into L1 cache? I was under the impression that you could fit kilobytes of data into cache, and that it was extremely fast. I don't know much about the timing of various operations in a CPU; am I missing something here? – Dan May 31 '12 at 08:35
  • @Dan: It will fit easily, but will take multiple cache lines. The main concern is what happens when the matrix isn't in cache. In that case, you'll probably have 2*64 byte loads from main memory (since caches read entire lines). – MSalters May 31 '12 at 08:56