How to optimize 4x4 matrix multiplication?

Question

I'm currently developing a CrossPlatform Graphic Engine, and the performance analysis says that I should optimize the matrixmultiplication.

Y check the matrices for modifications so I don't update the matrices if there is no change, but anyway, the world matrix multiplications are using a lot of processing percent.

Is there a way to do it faster using only c++ language tricks?

GRPMATRIX* GRPMATRIX::GetMulplicationMatrix(GRPMATRIX* a, GRPMATRIX* b)
{           
matrix[0][0] = a->matrix[0][0]*b->matrix[0][0]+a->matrix[1][0]*b->matrix[0][1]+a->matrix[2][0]*b->matrix[0][2]+a->matrix[3][0]*b->matrix[0][3];

matrix[0][1] = a->matrix[0][1]*b->matrix[0][0]+a->matrix[1][1]*b->matrix[0][1]+a->matrix[2][1]*b->matrix[0][2]+a->matrix[3][1]*b->matrix[0][3];
matrix[0][2] = a->matrix[0][2]*b->matrix[0][0]+a->matrix[1][2]*b->matrix[0][1]+a->matrix[2][2]*b->matrix[0][2]+a->matrix[3][2]*b->matrix[0][3];
matrix[0][3] = a->matrix[0][3]*b->matrix[0][0]+a->matrix[1][3]*b->matrix[0][1]+a->matrix[2][3]*b->matrix[0][2]+a->matrix[3][3]*b->matrix[0][3];

matrix[1][0] = a->matrix[0][0]*b->matrix[1][0]+a->matrix[1][0]*b->matrix[1][1]+a->matrix[2][0]*b->matrix[1][2]+a->matrix[3][0]*b->matrix[1][3];
matrix[1][1] = a->matrix[0][1]*b->matrix[1][0]+a->matrix[1][1]*b->matrix[1][1]+a->matrix[2][1]*b->matrix[1][2]+a->matrix[3][1]*b->matrix[1][3];
matrix[1][2] = a->matrix[0][2]*b->matrix[1][0]+a->matrix[1][2]*b->matrix[1][1]+a->matrix[2][2]*b->matrix[1][2]+a->matrix[3][2]*b->matrix[1][3];
matrix[1][3] = a->matrix[0][3]*b->matrix[1][0]+a->matrix[1][3]*b->matrix[1][1]+a->matrix[2][3]*b->matrix[1][2]+a->matrix[3][3]*b->matrix[1][3];

matrix[2][0] = a->matrix[0][0]*b->matrix[2][0]+a->matrix[1][0]*b->matrix[2][1]+a->matrix[2][0]*b->matrix[2][2]+a->matrix[3][0]*b->matrix[2][3];
matrix[2][1] = a->matrix[0][1]*b->matrix[2][0]+a->matrix[1][1]*b->matrix[2][1]+a->matrix[2][1]*b->matrix[2][2]+a->matrix[3][1]*b->matrix[2][3];
matrix[2][2] = a->matrix[0][2]*b->matrix[2][0]+a->matrix[1][2]*b->matrix[2][1]+a->matrix[2][2]*b->matrix[2][2]+a->matrix[3][2]*b->matrix[2][3];
matrix[2][3] = a->matrix[0][3]*b->matrix[2][0]+a->matrix[1][3]*b->matrix[2][1]+a->matrix[2][3]*b->matrix[2][2]+a->matrix[3][3]*b->matrix[2][3];

matrix[3][0] = a->matrix[0][0]*b->matrix[3][0]+a->matrix[1][0]*b->matrix[3][1]+a->matrix[2][0]*b->matrix[3][2]+a->matrix[3][0]*b->matrix[3][3];
matrix[3][1] = a->matrix[0][1]*b->matrix[3][0]+a->matrix[1][1]*b->matrix[3][1]+a->matrix[2][1]*b->matrix[3][2]+a->matrix[3][1]*b->matrix[3][3];
matrix[3][2] = a->matrix[0][2]*b->matrix[3][0]+a->matrix[1][2]*b->matrix[3][1]+a->matrix[2][2]*b->matrix[3][2]+a->matrix[3][2]*b->matrix[3][3];
matrix[3][3] = a->matrix[0][3]*b->matrix[3][0]+a->matrix[1][3]*b->matrix[3][1]+a->matrix[2][3]*b->matrix[3][2]+a->matrix[3][3]*b->matrix[3][3];

return this;
}

I don't do any FOR checks, no IF either, but I don't know if there might be a way to improve the performance or there is a dead end.

For anyone who is looking for something like this, after using gnasher answer, the code is like :

    float a00=a->matrix[0][0];
float a01=a->matrix[0][1];
float a02=a->matrix[0][2];
float a03=a->matrix[0][3];

float a10=a->matrix[1][0];
float a11=a->matrix[1][1];
float a12=a->matrix[1][2];
float a13=a->matrix[1][3];

float a20=a->matrix[2][0];
float a21=a->matrix[2][1];
float a22=a->matrix[2][2];
float a23=a->matrix[2][3];

float a30=a->matrix[3][0];
float a31=a->matrix[3][1];
float a32=a->matrix[3][2];
float a33=a->matrix[3][3];

float b00=b->matrix[0][0];
float b01=b->matrix[0][1];
float b02=b->matrix[0][2];
float b03=b->matrix[0][3];

float b10=b->matrix[1][0];
float b11=b->matrix[1][1];
float b12=b->matrix[1][2];
float b13=b->matrix[1][3];

float b20=b->matrix[2][0];
float b21=b->matrix[2][1];
float b22=b->matrix[2][2];
float b23=b->matrix[2][3];

float b30=b->matrix[3][0];
float b31=b->matrix[3][1];
float b32=b->matrix[3][2];
float b33=b->matrix[3][3];

matrix[0][0] = a00*b00+a10*b01+a20*b02+a30*b03;
matrix[0][1] = a01*b00+a11*b01+a21*b02+a31*b03;
matrix[0][2] = a02*b00+a12*b01+a22*b02+a32*b03;
matrix[0][3] = a03*b00+a13*b01+a23*b02+a33*b03;

matrix[1][0] = a00*b10+a10*b11+a20*b12+a30*b13;
matrix[1][1] = a01*b10+a11*b11+a21*b12+a31*b13;
matrix[1][2] = a02*b10+a12*b11+a22*b12+a32*b13;
matrix[1][3] = a03*b10+a13*b11+a23*b12+a33*b13;

matrix[2][0] = a00*b20+a10*b21+a20*b22+a30*b23;
matrix[2][1] = a01*b20+a11*b21+a21*b22+a31*b23;
matrix[2][2] = a02*b20+a12*b21+a22*b22+a32*b23;
matrix[2][3] = a03*b20+a13*b21+a23*b22+a33*b23;

matrix[3][0] = a00*b30+a10*b31+a20*b32+a30*b33;
matrix[3][1] = a01*b30+a11*b31+a21*b32+a31*b33;
matrix[3][2] = a02*b30+a12*b31+a22*b32+a32*b33;
matrix[3][3] = a03*b30+a13*b31+a23*b32+a33*b33;

to make it faster, use SSE2/AVX or other SIMD solutions http://stackoverflow.com/q/14967969/995714 http://stackoverflow.com/q/6617688/995714 http://stackoverflow.com/q/19806222/995714 ... if you need to do a lot of multiplications then multithreading also helps — phuclv, May 14 '15 at 12:12
I need to do it crossplatform, (pc, android, rpi) but thanks anyway — diego.martinez, May 14 '15 at 12:26
@AdamStelmaszczyk do you gain anything for such small 4 x 4 matrices? — vsoftco, May 14 '15 at 12:46
What was your performance analysis? and what percent of time was spent multiplying 4x4 matrices? I'm only asking because sometimes people are told their "hot spot" uses 10% of time, when there can easily be a sleeping giant elsewhere that they are not told about. — Mike Dunlavey, May 14 '15 at 12:47
Can you use an existing library for your matrix types? Something like Eigen would be a good choice. — Peter, May 14 '15 at 12:55
I assume these matrices are `double matrix[4][4]` and not a "jagged matrix", like `double **matrix` with separate allocations for each row? — Peter, May 14 '15 at 12:56
not jagged. And I need to compile in raspberry pi and I can't use externals libs due to company requirements. I tested gnasher729 improvement and worked like a charm. Thanks anyway to these comments :D — diego.martinez, May 14 '15 at 14:50
the graphic data is all in the gpu, so the cpu almost only works matrix mults. I'm still optimizating the gpu calls, but I get a nice result in nvidia perfmon. — diego.martinez, May 14 '15 at 14:51
@diego.martinez Hard to say, but it will be interesting to give it a shot. — Adam Stelmaszczyk, May 14 '15 at 16:51

score 3 · Accepted Answer · answered May 14 '15 at 12:56

3

One problem that you have is that at any assignment matrix [i][j] = ..., the compiler doesn't know that a and b are not pointing to this->matrix, so it must assume that elements of a and b are overwritten and needs to read them again.

You should get some improvement if you just write

b0 = b->matrix [0][0]; b1 = b->matrix [0][1]; ... matrix [0][0] = ...

b0 = b->matrix [1][0]; b1 = b->matrix [1][1]; ... matrix [1][0] = ...

etc.

Reading Peter's comments: If these matrices are actually arrays of pointers to arrays of doubles, that is an absolute performance killer. Just don't do it.

answered May 14 '15 at 12:56

gnasher729

51,477
5
75
98

You're right. In fact I would just have a local 4x4 matrix inside the routine, and copy it out before exit. – Mike Dunlavey May 14 '15 at 13:00
1

great! doing this it improved a 40%! I was doing 4000000 mults in 348 ms and now are 209 ms :D – diego.martinez May 14 '15 at 14:47
1

and no, they aren't array arrays, is plain float m[4][4]. – diego.martinez May 14 '15 at 14:48
would it be improvement if I use double instead of float? – diego.martinez May 14 '15 at 14:48
1

@diego.martinez Test it. – Adam Stelmaszczyk May 14 '15 at 16:19

How to optimize 4x4 matrix multiplication?

1 Answers1