1

I try to implement the Cannon's algorithm of matrix multiplication. I read description on the wikipedia that provides next pseudocode:

   row i of matrix a is circularly shifted by i elements to the left.
   col j of matrix b is circularly shifted by j elements up.
   Repeat n times:
       p[i][j] multiplies its two entries and adds to running total.
       circular shift each row of a 1 element left
       circular shift each col of b 1 element up

and I implemented it on the C# next way:

    public static void ShiftLeft(int[][] matrix, int i, int count)
    {
        int ind = 0;
        while (ind < count)
        {
            int temp = matrix[i][0];
            int indl = matrix[i].Length - 1;
            for (int j = 0; j < indl; j++)
                matrix[i][j] = matrix[i][j + 1];
            matrix[i][indl] = temp;
            ind++;
        }
    }
    public static void ShiftUp(int[][] matrix, int j, int count)
    {
        int ind = 0;
        while (ind < count)
        {
            int temp = matrix[0][j];
            int indl = matrix.Length - 1;
            for (int i = 0; i < indl; i++)
                matrix[i][j] = matrix[i + 1][j];
            matrix[indl][j] = temp;
            ind++;
        }
    }

    public static int[][] Cannon(int[][] A, int[][] B)
    {
        int[][] C = new int[A.Length][];
        for (int i = 0; i < C.Length; i++)
            C[i] = new int[A.Length];
        for (int i = 0; i < A.Length; i++)
            ShiftLeft(A, i, i);

        for (int i = 0; i < B.Length; i++)
            ShiftUp(B, i, i);

        for (int k = 0; k < A.Length; k++)
        {
            for (int i = 0; i < A.Length; i++)
            {
                for (int j = 0; j < B.Length; j++)
                {
                    var m = (i + j + k) % A.Length;
                    C[i][j] += A[i][m] * B[m][j];
                    ShiftLeft(A, i, 1);
                    ShiftUp(B, j, 1);
                }

            }
        };

        return C;

    }

this code return correct result, but do it very slowly. Much slowly even than naive algorithm of matrix multiplication.

For matrix 200x200 I got that result:

00:00:00.0490432 //naive algorithm
00:00:07.1397479 //Cannon's algorithm

What I am doing wrong?


Edit

Thanks SergeySlepov, it was bad attempt to do it parallel. When I back to sequential implementation I got next result:

Count       Naive               Cannon's

200    00:00:00.0492098    00:00:08.0465076

250    00:00:00.0908136    00:00:22.3891375

300    00:00:00.1477764    00:00:58.0640621

350    00:00:00.2639114    00:01:51.5545524

400    00:00:00.4323984    00:04:50.7260942

okay, it's not a parallel implementation, but how can I do it correctly?

FateOri
  • 79
  • 1
  • 9
  • 3
    You don't need to *actually* move the elements. Something like `A[(i + shift) % len]` will do the trick. – Sergey Slepov Apr 16 '17 at 17:38
  • 3
    This algorithm is a **distributed algorithm**. You're very unlikely to see improvements unless you have (a lot of) machines applying it to much larger matrices than 200x200. – IVlad Apr 16 '17 at 17:39
  • @SergeySlepov thanks, I thought about that, but can't got it (It seems it's very hard to do properly) – FateOri Apr 16 '17 at 17:51
  • @IVlad thanks, I know, but, for my opinion, difference is very large. Too large isn't it? – FateOri Apr 16 '17 at 17:55
  • Also, using `new int [N,N]` will be slightly more efficient than `new int [N] [N]` and will remove the need for `C[i] = new int[A.Length];` – Sergey Slepov Apr 16 '17 at 18:06
  • @SergeySlepov I'm not sure about that. [See this question](http://stackoverflow.com/questions/597720/what-are-the-differences-between-a-multidimensional-array-and-an-array-of-arrays) – FateOri Apr 16 '17 at 18:13
  • 1
    @SergeySlepov, yep, the speed decreased in two times. So, array of array win :) – FateOri Apr 16 '17 at 18:36
  • In the original algorithm, it is the two inner for loops that are executed in parallel (on an NxN processor grid), while the outer loop is sequential. Your implementation does the opposite and hence is likely to produce incorrect and unstable results due to race conditions. – Sergey Slepov Apr 16 '17 at 18:37
  • @Sergey Slepov perhaps, I don't know, can you give me a link to the original algorithm? (cause I thought, it was a original algorithm :)) Why do your think that there race conditions in this implementation? – FateOri Apr 16 '17 at 19:35
  • @FateOri: Your code has race conditions because the Parallel.For loop might create multiple thread which will access A, B and C concurrently. – Sergey Slepov Apr 16 '17 at 20:28
  • @SergeySlepov you're right, I understood it yesterday – FateOri Apr 17 '17 at 07:16

1 Answers1

1

Cannon's algorithm was built for a 'Distributed Memory Machine' (a grid of processors, each with its own memory). This is very different to the hardware you're running it on (a few processors with shared memory) and that is why you're not seeing any increase in performance.

The 'circular shifts' in the pseudocode that you quoted actually mimic data transfers between processors. After the initial matrix 'skewing', each processor in the grid keeps track of three numbers (a, b and c) and executes pseudocode similar to this:

c += a * b;
pass 'a' to the processor to your left (wrapping around)
pass 'b' to the processor to 'above' you (wrapping around)
wait for the next iteration of k

We could mimic this behaviour on a PC using NxN threads but the overhead of context switching (or spawning Tasks) would kill all the joy. To make the most of a PC's 4 (or so) CPUs we could make the loop over i parallel. The loop over k needs to be sequential (unlike your solution), otherwise you might face racing conditions as each iteration of k modifies the matrices A, B and C. In a 'distributed memory machine' race conditions are not a problem as processors do not share any memory.

Sergey Slepov
  • 1,861
  • 13
  • 33
  • Thanks for the answer. There something that still not clear for me, I mean this part - "To make the most of a PC's 4 (or so) CPUs we could make the loop over i parallel. The loop over k needs to be sequential (unlike your solution), otherwise you might face racing conditions as each iteration of k modifies the matrices A, B and C. ". Anyway, thanks - I will be test it tomorrow. – FateOri Apr 16 '17 at 20:22
  • What is not clear? If you could shape your thoughts into a question I might be able to help you. – Sergey Slepov Apr 16 '17 at 20:29
  • I mean, if I just change loop for "i" into the "Parallel.For" I'll still getting race conditions, isn't it? – FateOri Apr 17 '17 at 10:16