0

Currently I'm working on a program that uses matrices. I came up with this nested loop to multiply two matrices:

// The matrices are 1-dimensional arrays
for (int i = 0; i < 4; i++)
    for (int j = 0; j < 4; j++)
        for (int k = 0; k < 4; k++)
            result[i * 4 + j] += M1[i * 4 + k] * M2[k * 4 + j];

The loop works. My question is: will this loop be slower compared to writing it all out manually like this:

result[0] = M1[0]*M2[0] + M1[1]*M2[4] + M1[2]*M2[8] + M1[3]*M2[12];
result[1] = M1[0]*M2[1] + M1[1]*M2[5] + M1[2]*M2[9] + M1[4]*M2[13];
result[2] = ... etc.

Because in the nested loop, the array positions are calculated and in the second method, they do not.

Thanks.

Hans
  • 61
  • 1
  • 1
  • 6
  • Not in any way that would justify the resulting code bloat. The compiler will unroll the loop if it judges it appropriate. – T.C. Aug 26 '14 at 19:56
  • I would expect a loop that small to be unrolled anyway, so it shouldn't affect performance. – genisage Aug 26 '14 at 19:56
  • Keep in mind that if you have larger size matrices you can get a significant performance improvement by re-ordering your loops to `i`, `k`, `j`. See http://stackoverflow.com/questions/7395556/why-does-the-order-of-loops-in-a-matrix-multiply-algorithm-affect-performance. This doesn't seem quite to be a dup of that one though. – Mark B Aug 26 '14 at 19:58
  • It will be slow because it will iterate through extra for loops. However when you do loop unrolling that will save you 3 extra loops. Loop unrolling takes shorter time with fewer loops. – Juniar Aug 26 '14 at 20:14
  • Given that your loops have fixed small number of iterations the compiler is likely to unroll them anyway. Moreover, the code produced by the compiler may take into account spacial coherence and possibly even SSE to improve performance further than your manual loop unrolling. – Michael Aug 26 '14 at 20:26

4 Answers4

4

As with so many things, "it depends", but in this instance I would tend toward the second, expanded form performing just about the same. Any modern compiler will unroll appropriate loops for you, and take care of it.

Two points perhaps worth making:

  1. The second approach is uglier, is more prone to errors and tedious to write/maintain.

  2. This is a nice example of 'premature optimization' (AKA the root of all evil). Do you know if this section is a bottleneck? Is this really the most intensive part of the code? By optimizing so early we incur everything in point #1 for what amounts to a hunch if we haven't bench marked our code.

Community
  • 1
  • 1
John Carter
  • 6,752
  • 2
  • 32
  • 53
  • 1
    The OP is just a mock or blocked example of the MMM triple loop, it will not work for matrices other than 4x4 i.e. it is completely impractical – SkyWalker Aug 26 '14 at 20:12
  • Yes, I agree. I stepped outside the MMM problem space and was thinking about bit more generally about nested loops. – John Carter Aug 26 '14 at 20:20
  • Thanks. I will keep this in mind and measure my performance if I need optimization. – Hans Aug 26 '14 at 20:21
0

Your compiler might already do this, take a look at loop unrolling. Let the compiler do the guessing and the heavy work, stick to the clean code, and as always, measure your performance.

erenon
  • 18,838
  • 2
  • 61
  • 93
  • Nop it won't, because of aliasing .. you have to tell the compiler that there is no aliasing AFAIK – SkyWalker Aug 26 '14 at 20:00
  • @GiovanniAzua You can *unroll* these loops regardless of whether they alias. You just can't *reorder* the iterations without telling the compiler that they don't alias. (assuming it's unable to deduce it on it's own) – Mysticial Aug 26 '14 at 20:34
  • @Mystical I didn't say you can't unroll the loops, I said that the compiler wont do it for you automatically without the appropriate `restrict` modifier. – SkyWalker Aug 27 '14 at 07:48
0

I don't think the loop will be slower. You are accessing the memory of the M1 and M2 arrays in the same way in both instances i.e. . If you want to make the "manual" version faster then use scalar replacement and do the computation on registers e.g.

 double M1_0 = M1[0];
 double M2_0 = M2[0];
 result[0] = M1_0*M2_0 + ...

but you can use scalar replacement within the loop as well. You can do it if you do blocking and loop unrolling (in fact your triple loop looks like a blocking version of the MMM).

What you are trying to do is to speed up the program by improving locality i.e. better use of the memory hierarchy and better locality.

SkyWalker
  • 13,729
  • 18
  • 91
  • 187
0

Assuming that you are running code on Intel processors or compatible (AMD) you may actually want to switch to assembly language to do heavy matrix computations. Luckily, you have the Intel-IPP library that does the actual work for you using advanced processor technology and selecting what is expected to be the fastest algorithm depending on your processor.

The IPP includes all the necessary matrix computations that you'd possibly need. The only problem you may encounter is the order in which you created your matrices. You may have to re-organize the order to make it easier to use the IPP functions you'd like to use.

Note that in regard to your two code examples, the second one will be faster because you avoid the += operator which is a read / modify / write cycle and that's generally slow (not only that, it requires the result matrix to be all zeroes to start with whereas the second example does not require clearing the output first), although your matrices are likely to fit in the cache... but, the processors are optimized to read input data in sequence (a[0], a1, a[2], a[3], ...) and also to write that data back in sequence. If you can write your algorithm to be as close as possible to such a sequence, all the better. Don't get me wrong, I know that matrix multiplications cannot be done in sequence. But if you think of that to do your optimization, you'll achieve better results (i.e. change the order in which your matrices are saved in memory could be one of them).

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156