Matrix Multiplication optimization via matrix transpose

Question

I am working on an assignment where I transpose a matrix to reduce cache misses for a matrix multiplication operation. From what I understand from a few classmates, I should get 8x improvement. However, I am only getting 2x ... what might I be doing wrong?

Full Source on GitHub

void transpose(int size, matrix m) {
    int i, j;
    for (i = 0; i < size; i++) 
        for (j = 0; j < size; j++) 
            std::swap(m.element[i][j], m.element[j][i]);
}

void mm(matrix a, matrix b, matrix result) {
    int i, j, k;
    int size = a.size;
    long long before, after;

    before = wall_clock_time();
    // Do the multiplication
    transpose(size, b); // transpose the matrix to reduce cache miss
    for (i = 0; i < size; i++)
        for (j = 0; j < size; j++) {
            int tmp = 0; // save memory writes
            for(k = 0; k < size; k++)
                tmp += a.element[i][k] * b.element[j][k];
            result.element[i][j] = tmp;
        }
    after = wall_clock_time();
    fprintf(stderr, "Matrix multiplication took %1.2f seconds\n", ((float)(after - before))/1000000000);
}

Am I doing things right so far?

FYI: The next optimization I need to do is use SIMD/Intel SSE3

I would also recommend you to consider the option of loop tiling so that the parts of both matrices are essentially cached decreasing the overall execution time for floating point operations. — Recker, Oct 03 '12 at 04:45
@abhinole - good suggestion, but that's kinda tough to do when a matrix is represented as a ragged array, with rows individually allocated with `malloc`. — David Hammen, Oct 03 '12 at 06:24

David Hammen · Accepted Answer · 2012-10-03T06:23:12.730

11

Am I doing things right so far?

No. You have a problem with your transpose. You should have seen this problem before you started worrying about performance. When you are doing any kind of hacking around for optimizations it always a good idea to use the naive but suboptimal implementation as a test. An optimization that achieves a factor of 100 speedup is worthless if it doesn't yield the right answer.

Another optimization that will help is to pass by reference. You are passing copies. In fact, your matrix result may never get out because you are passing copies. Once again, you should have tested.

Yet another optimization that will help the speedup is to cache some pointers. This is still quite slow:

for(k = 0; k < size; k++)
    tmp += a.element[i][k] * b.element[j][k];
result.element[i][j] = tmp;

An optimizer might see a way around the pointer problems, but probably not. At least not if you don't use the nonstandard __restrict__ keyword to tell the compiler that your matrices don't overlap. Cache pointers so you don't have to do a.element[i], b.element[j], and result.element[i]. And it still might help to tell the compiler that these arrays don't overlap with the __restrict__ keyword.

Addendum
After looking over the code, it needs help. A minor comment first. You aren't writing C++. Your code is C with a tiny hint of C++. You're using struct rather than class, malloc rather than new, typedef struct rather than just struct, C headers rather than C++ headers.

Because of your implementation of your struct matrix, my comment on slowness due to copy constructors was incorrect. That it was incorrect is even worse! Using the implicitly-defined copy constructor in conjunction with classes or structs that contain naked pointers is playing with fire. You will get burned very badly if someone calls m(a, a, a_squared) to get the square of matrix a. You will get burned even worse if some expects m(a, a, a) to do an in-place computation of a².

Mathematically, your code only covers a tiny portion of the matrix multiplication problem. What if someone wants to multiply a 100x1000 matrix by a 1000x200 matrix? That's perfectly valid, but your code doesn't handle it because your code only works with square matrices. On the other hand, your code will let someone multiply a 100x100 matrix by a 200x200 matrix, which doesn't make a bit of sense.

Structurally, your code has close to a 100% guarantee that it will be slow because of your use of ragged arrays. malloc can spritz the rows of your matrices all across memory. You'll get much better performance if the matrix is internally represented as a contiguous array but is accessed as if it were a NxM matrix. C++ provides some nice mechanisms for doing just that.

edited Oct 03 '12 at 06:23

answered Oct 03 '12 at 04:32

David Hammen

32,454
9
60
108

+1 Nice catch on the transpose problem. I read right past it. – Joel Lee Oct 03 '12 at 04:35
Same as @JoelLee, +1 on the catch :) – Dig Oct 03 '12 at 05:37
Hmm, I think the confusion with C/C++ and all the low level memory stuff is due me not usually programming in lower level languages ... I'll try to read a bit more on it ... re – Jiew Meng Oct 03 '12 at 06:36
Regarding the thing about square matrixes only, I think thats the requirement for this assignment, tho I will verify that again. – Jiew Meng Oct 03 '12 at 06:39
About ur last comment on malloc allocating non contiguous memory. How do I make it contiguous then? – Jiew Meng Oct 03 '12 at 06:43
@JiewMeng - One allocation per matrix rather than `size+1` allocations. You'll have to implement the 2D indexing yourself rather than letting C do it for you. With C++, one way to do that is to overload `operator()`. – David Hammen Oct 03 '12 at 06:48
Oh I think I get what u mean: `allocate_matrix` allocates memory cell by cell? Perhaps I should do something like: `new float[size]` instead? – Jiew Meng Oct 03 '12 at 10:28
Cell by cell? No! You want one big chunk of contiguous memory for the whole matrix. The internal representation is `float * element;` and is allocated as `mem->element = malloc(sizeof(float) * size * size);` – David Hammen Oct 03 '12 at 10:35
Hmm I mean the existing code allocates space cell by cell (the code is actually given)? And you suggest, I allocate it as a whole? Hmm, the largest test case is `float[4K][4k]` matrix so it will require 256MB of contiguous space? Is that reasonable? – Jiew Meng Oct 03 '12 at 10:52
Just saw on http://stackoverflow.com/a/936702/292291, how a 2D array is allocated in C++, is that bad too? Looks very similar to the existing code – Jiew Meng Oct 03 '12 at 11:05
That's called a ragged array, or a jagged array. Very handy, but they come at a price, particularly if the matrix is large. There's a huge difference between `float **data` and `float data[100][100]`. Both use `data[i][j]` to access the i,j element. Internally, memory layout and the mechanisms to accomplish that indexing are very different between the two forms. Emulate the latter and your matrix multiplication will be considerably faster. – David Hammen Oct 03 '12 at 11:32
Ok will try that ... in the mean time, I do have an attempt to re-write from scratch using "C++" (in quotes because I'm not sure if I am doing it right still ...) https://gist.github.com/3824894#file_matrix2.cpp, its slower in fact (~7s) compared to the old version (C with little C++, ~4s). The provided unoptimized version took ~10s – Jiew Meng Oct 03 '12 at 12:03
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/17501/discussion-between-jiew-meng-and-david-hammen) – Jiew Meng Oct 03 '12 at 12:04

score 4 · Answer 2 · answered Oct 03 '12 at 20:43

If your assignment implies that you MUST transpose, then, of course, you should correct your transpose procedure. As it stands, it does the transpose TWO times, resulting in no transpose at all. The j=loop should not read

j=0; j<size; j++

but

j=0; j<i; j++

Transposing is not necessary to avoid processing the elements of one of the factor-matrices in the "wrong" order. Just interchange the j-loop and the k-loop. Leaving aside for the moment any (other) performance-tuning, the basic loop-structure should be:

  for (int i=0; i<size; i++)
  {
    for (int k=0; k<size; k++)
    {
      double tmp = a[i][k];
      for (int j=0; j<size; j++)
      {
        result[i][j] += tmp * b[k][j];
      }
    }
  }

Matrix Multiplication optimization via matrix transpose

2 Answers2