Performance of nested loop vs hard coded matrix multiplication

Question

I am reading a book on 2D game programming and am being walked through a 3x3 matrix class for linear transformations. The author has written a method for multiplying two 3x3 matrices as follows.

public Matrix3x3f mul(Matrix3x3f m1)
{
    return new Matrix3x3f(new float[][]
    {
        {
              this.m[0][0] * m1.m[0][0]     // M[0,0]
            + this.m[0][1] * m1.m[1][0]
            + this.m[0][2] * m1.m[2][0],
              this.m[0][0] * m1.m[0][1]     // M[0,1]
            + this.m[0][1] * m1.m[1][1]
            + this.m[0][2] * m1.m[2][1],
              this.m[0][0] * m1.m[0][2]     // M[0,2]
            + this.m[0][1] * m1.m[1][2]
            + this.m[0][2] * m1.m[2][2],
        },
        {
              this.m[1][0] * m1.m[0][0]     // M[1,0]
            + this.m[1][1] * m1.m[1][0]
            + this.m[1][2] * m1.m[2][0],
              this.m[1][0] * m1.m[0][1]     // M[1,1]
            + this.m[1][1] * m1.m[1][1]
            + this.m[1][2] * m1.m[2][1],
              this.m[1][0] * m1.m[0][2]     // M[1,2]
            + this.m[1][1] * m1.m[1][2]
            + this.m[1][2] * m1.m[2][2],
        },
        {
              this.m[2][0] * m1.m[0][0]     // M[2,0]
            + this.m[2][1] * m1.m[1][0]
            + this.m[2][2] * m1.m[2][0],
              this.m[2][0] * m1.m[0][1]     // M[2,1]
            + this.m[2][1] * m1.m[1][1]
            + this.m[2][2] * m1.m[2][1],
              this.m[2][0] * m1.m[0][2]     // M[2,2]
            + this.m[2][1] * m1.m[1][2]
            + this.m[2][2] * m1.m[2][2],
        },
    });
}

If I personally needed to write a method to do the same I would have come up with some nested loop which did all of these calculations automatically, I am assuming that perhaps the author has written it out this way so that people with little math background can follow along easier.

Does this sound like a fair assumption or could a nested loop version of this method possibly cause performance issues when used heavily in a loop where performance is vital?

I dont think you will get much performance benefit. Aside if its 3*3 then fine, what if tomorrow you will need to multiply 4*4? Will you add to existing multiplication or look for alternative? — SMA, Dec 06 '14 at 09:32
that depends on target platform and compiler/interpreter for non parallel systems is hardcoded version usually faster if coded in the right way. In modern multi pipeline multi core CPU's that does not matter so much (but unless it is not too big for the CACHE size) — Spektre, Dec 06 '14 at 09:32
Also for best performance in java should be used 1D matrices for transformations because of references in 2D matrices. For example https://github.com/libgdx/libgdx/blob/master/gdx/src/com/badlogic/gdx/math/Matrix3.java — xio4, Dec 06 '14 at 09:39

score 2 · Answer 1 · edited May 23 '17 at 11:43

2

I think this is a performance issue. If you use a loop, it will use a lot of jumping orders, since every iteration it needs to check "if cond goto ___". You should read this post on Branch Prediction and also learn a bit on computer architecture to understand how instructions affects performance, in this case I think you might find caching interesting.

edited May 23 '17 at 11:43

Community

1
1

answered Dec 06 '14 at 09:33

ZivS

2,094
2
27
48

score 1 · Accepted Answer · answered Dec 06 '14 at 09:32

From the looks of it, I think it's for clarity's sake, not for performance's sake. Consider the fact that it's Java code. There's object allocation in the return statement. If it were so performance critical that the conditional jump of a for-loop can't be afforded, the result would be written into a mutable instance.

score 0 · Answer 3 · answered Dec 06 '14 at 09:34

0

If the hardcoded operations are exactly the same as the operations processed by a loop, I can see no reason why the loop would be less efficient (or at least, not in a considerable way). Actually, large loops (which is not the case here) are more efficient than hardcoding by far because :

some optimizations can be processed by the compiler and the JVM at runtime
(they enable a clearer code and a shorter binary)

I heard that soometimes it could be better to hardcode the operations if the loop iterates through a tiny space but I don't think it is really interesting to do so.

Finally, for multiplying matrices, using a loop or not won't change much things, what could speed up your calculations is using dynamic programming. I don't know if it's worth doing it for small matrices but if I were you I would give it a try.

answered Dec 06 '14 at 09:34

Dici

25,226
7
41
82

Misleading answer, using a loop can cause a serious performance overhead. if the the memory blocks aren't aligned looping might cause a cache miss and that halt the CPU for a few cycles until the correct block arrives. – ZivS Dec 06 '14 at 09:41
Loops are the most basic structure in any imperative language. If they were that harmful, they would not even exist – Dici Dec 06 '14 at 09:58
They are indeed basic structures but their implementation in machine code (assembly) demands more CPU instructions than just ADD or MUL, CMP and JMP (and other variation of jump) are required. Say a CPU can have 16 instruction in its pipeline, if a JNE instruction is one of those, and the branch predictor made a wrong decision, 15 ADD instruction will be flushed, and 15 cycles were lost...If the loop is very small this will not be that serious, but still without a loop you gained ~3 CPI. So loops aren't that harmful but they could play major role in performance. – ZivS Dec 06 '14 at 13:50

score 0 · Answer 4 · answered Dec 07 '14 at 08:15

This is definitely for performance issue. Having nested loops that have to increment the loop index and to check whether the loop has ended always makes it a slower implementation. For computer graphic and CAD/CAM software, the 3x3 or 4x4 matrix multiplication will be done for every rendering action. So, the matrix multiplication can be easily done millions of times. Therefore, implementing 3x3 or 4x4 matrix multiplication without using nested loops is a common practice, especially in the older days where there is no such thing as GPU. For matrices with more than 4 rows/columns, nested loops approach is still used.

Performance of nested loop vs hard coded matrix multiplication

4 Answers4