9

I intend to multiply 2 matrices using the cache-friendly method ( that would lead to less number of misses)

I found out that this can be done with a cache friendly transpose function.

But I am not able to find this algorithm. Can I know how to achieve this?

Aakash Anuj
  • 3,773
  • 7
  • 35
  • 47

2 Answers2

7

The word you are looking for is thrashing. Searching for thrashing matrix multiplication in Google yields more results.

A standard multiplication algorithm for c = a*b would look like

void multiply(double[,] a, double[,] b, double[,] c)
{
    for (int i = 0; i < n; i++)
        for (int j = 0; j < n; j++)
            for (int k = 0; k < n; k++)
                C[i, j] += a[i, k] * b[k, j]; 
}

Basically, navigating the memory fastly in large steps is detrimental to performance. The access pattern for k in B[k, j] is doing exactly that. So instead of jumping around in the memory, we may rearrange the operations such that the most inner loops operate only on the second access index of the matrices:

void multiply(double[,] a, double[,] B, double[,] c)
{  
   for (i = 0; i < n; i++)
   {  
      double t = a[i, 0];
      for (int j = 0; j < n; j++)
         c[i, j] = t * b[0, j];

      for (int k = 1; k < n; k++)
      {
         double s = 0;
         for (int j = 0; j < n; j++ )
            s += a[i, k] * b[k, j];
         c[i, j] = s;
      }
   }
}

This was the example given on that page. However, another option is to copy the contents of B[k, *] into an array beforehand and use this array in the inner loop calculations. This approach is usually much faster than the alternatives, even if it involves copying data around. Even if this might seem counter-intuitive, please feel free to try for yourself.

void multiply(double[,] a, double[,] b, double[,] c)
{
    double[] Bcolj = new double[n];
    for (int j = 0; j < n; j++)
    {
        for (int k = 0; k < n; k++)
            Bcolj[k] = b[k, j];

        for (int i = 0; i < n; i++)
        {
            double s = 0;
            for (int k = 0; k < n; k++)
                s += a[i,k] * Bcolj[k];
            c[j, i] = s;
        }
   }
}
Cesar
  • 2,059
  • 25
  • 30
  • in your second code block, `c[i, j] = s;`, but it seems that `j` is not declared in that scope. – Shihao Xu Apr 18 '17 at 19:52
  • I'm wondering why this is the accepted answer, the inner loop over k is accessing a by column, totally wrong from performance point of view. – greywolf82 Feb 09 '18 at 18:18
  • 1
    The code is assuming a C-like language, where matrices are row-major. When accessing a matrix stored in row-major order using ```a[i,j]``` you should always make sure that ```j``` always changes faster than ```i``` if you want to maximize performance. – Cesar Feb 13 '18 at 19:09
  • the second code snippet is wrong – Karashevich B. Nov 10 '21 at 12:49
1

@Cesar's answer is not correct. For example, the inner loop

for (int k = 0; k < n; k++)
   s += a[i,k] * Bcolj[k];

goes through the i-th column of a.

The following code should ensure we always visit data row by row.

void multiply(const double (&a)[I][K], 
              const double (&b)[K][J], 
              double (&c)[I][J]) 
{
    for (int j=0; j<J; ++j) {
       // iterates the j-th row of c
       for (int i=0; i<I; ++i) {
         c[i][j] = 0;
       } 

       // iterates the j-th row of b
       for (int k=0; k<K; ++k) {
          double t = b[k][j];
          // iterates the j-th row of c
          // iterates the k-th row of a
          for (int i=0; i<I; ++i) {
            c[i][j] += a[i][k] * t;
          } 
       }
    }
}
o11c
  • 15,265
  • 4
  • 50
  • 75
Joe C
  • 2,757
  • 2
  • 26
  • 46
  • 2
    Your code is wrong too. The reset of c[i][j] could be totally optional (it depends if the caller reset the matrix to zero). In addition the loop over k starts from 1 but it should starts from zero. – greywolf82 Feb 09 '18 at 18:16
  • @greywolf82 c[i][j] needs to reset, because the accumulation of "c[i][j] += a[i][k] * t;" needs an initial value. "k starts from 0" is correct. fixed. – Joe C Feb 09 '18 at 22:23
  • Yes, I know but if the caller did a memset to zero for example, the loop is not needed. Add a comment in your code to clarify. – greywolf82 Feb 10 '18 at 06:33