Which is more costly for the processor?

Question

I am working on a small Math lib for 3D graphics.

I'm not sure what costs more to the CPU/GPU in terms of time.

Right now I am using this to multiply matrix (4x4)

  tmpM.p[0][0] = matA.p[0][0] * matB.p[0][0] + matA.p[0][1] * matB.p[1][0] + matA.p[0][2] * matB.p[2][0] + matA.p[0][3] * matB.p[3][0];
  tmpM.p[0][1] = matA.p[0][0] * matB.p[0][1] + matA.p[0][1] * matB.p[1][1] + matA.p[0][2] * matB.p[2][1] + matA.p[0][3] * matB.p[3][1];
  tmpM.p[0][2] = matA.p[0][0] * matB.p[0][2] + matA.p[0][1] * matB.p[1][2] + matA.p[0][2] * matB.p[2][2] + matA.p[0][3] * matB.p[3][2];
  tmpM.p[0][3] = matA.p[0][0] * matB.p[0][3] + matA.p[0][1] * matB.p[1][3] + matA.p[0][2] * matB.p[2][3] + matA.p[0][3] * matB.p[3][3];

  tmpM.p[1][0] = matA.p[1][0] * matB.p[0][0] + matA.p[1][1] * matB.p[1][0] + matA.p[1][2] * matB.p[2][0] + matA.p[1][3] * matB.p[3][0];
  tmpM.p[1][1] = matA.p[1][0] * matB.p[0][1] + matA.p[1][1] * matB.p[1][1] + matA.p[1][2] * matB.p[2][1] + matA.p[1][3] * matB.p[3][1];
  tmpM.p[1][2] = matA.p[1][0] * matB.p[0][2] + matA.p[1][1] * matB.p[1][2] + matA.p[1][2] * matB.p[2][2] + matA.p[1][3] * matB.p[3][2];
  tmpM.p[1][3] = matA.p[1][0] * matB.p[0][3] + matA.p[1][1] * matB.p[1][3] + matA.p[1][2] * matB.p[2][3] + matA.p[1][3] * matB.p[3][3];

  tmpM.p[2][0] = matA.p[2][0] * matB.p[0][0] + matA.p[2][1] * matB.p[1][0] + matA.p[2][2] * matB.p[2][0] + matA.p[2][3] * matB.p[3][0];
  tmpM.p[2][1] = matA.p[2][0] * matB.p[0][1] + matA.p[2][1] * matB.p[1][1] + matA.p[2][2] * matB.p[2][1] + matA.p[2][3] * matB.p[3][1];
  tmpM.p[2][2] = matA.p[2][0] * matB.p[0][2] + matA.p[2][1] * matB.p[1][2] + matA.p[2][2] * matB.p[2][2] + matA.p[2][3] * matB.p[3][2];
  tmpM.p[2][3] = matA.p[2][0] * matB.p[0][3] + matA.p[2][1] * matB.p[1][3] + matA.p[2][2] * matB.p[2][3] + matA.p[2][3] * matB.p[3][3];

  tmpM.p[3][0] = matA.p[3][0] * matB.p[0][0] + matA.p[3][1] * matB.p[1][0] + matA.p[3][2] * matB.p[2][0] + matA.p[3][3] * matB.p[3][0];
  tmpM.p[3][1] = matA.p[3][0] * matB.p[0][1] + matA.p[3][1] * matB.p[1][1] + matA.p[3][2] * matB.p[2][1] + matA.p[3][3] * matB.p[3][1];
  tmpM.p[3][2] = matA.p[3][0] * matB.p[0][2] + matA.p[3][1] * matB.p[1][2] + matA.p[3][2] * matB.p[2][2] + matA.p[3][3] * matB.p[3][2];
  tmpM.p[3][3] = matA.p[3][0] * matB.p[0][3] + matA.p[3][1] * matB.p[1][3] + matA.p[3][2] * matB.p[2][3] + matA.p[3][3] * matB.p[3][3];

Is this a bad/slow idea?

Will it be more efficient to use a loop?

I know this isn't the answer you're hoping for, but for optimization you really need to just **try it**. Many factors come into play beyond just the code you've written. Your compiler, the build flags, and the target CPU to name a few. — Drew Dormann, Feb 05 '15 at 16:27
Take a look at the assembler produced. You may find that the generated code will spend a lot of cycles doing array dereferencing. A loop may be better because it helps the compiler to optimize the address dereferencing. — ChuckCottrill, Feb 05 '15 at 16:28
I think I will test both basic and loop way and check the time it takes to do it :D — Nacho, Feb 05 '15 at 16:35
you should use SIMD if you want more efficient operations http://stackoverflow.com/questions/18499971/efficient-4x4-matrix-multiplication-c-vs-assembly http://stackoverflow.com/questions/21503882/efficient-sse-nxn-matrix-multiplication http://stackoverflow.com/questions/14967969/efficient-4x4-matrix-vector-multiplication-with-sse-horizontal-add-and-dot-prod http://stackoverflow.com/questions/8285852/how-do-i-perform-8-x-8-matrix-operation-using-sse http://stackoverflow.com/questions/6617688/speed-up-matrix-multiplication-by-sse-c — phuclv, Feb 05 '15 at 16:48
Well , now I did a loop and... 1000 lines of assembly code with the basic version , 75 with the loop . All is said I think ?? — Nacho, Feb 05 '15 at 16:55
A the very least, you can bring down the number of multiplications, I remember that you can do quite some tricks with 4x4 matrices. Unfortunately, I don't have my algorithm book at hand, so I can't look it up :-( Apart from that, a loop will likely show worse performance than this coded out stuff, simply because it adds the loop overhead. — cmaster - reinstate monica, Feb 05 '15 at 17:06
@Nacho "lines of assembly" will not measure which code is more *costly*, unless the cost you're concerned about is executable size. — Drew Dormann, Feb 05 '15 at 17:16
What kind of a datatype is the matrix? What type is `matA.p`? If you use the wrong type here, that's worse than a badly written matrix-matrix multiplication. — cmaster - reinstate monica, Feb 05 '15 at 17:20

score 0 · Answer 1 · answered Feb 05 '15 at 16:37

0

It will mostly depend on what the compiler manages to figure out from it. If you can't time the operation in a context similar or identical to use (which remains the best way to figure it out), then I would guess a loop with a functor (functor object or lambda) is likely the best bet for the compiler to be able to figure out a cache friendly unroll and a cheap access to the operation. A half decent modern compiler will also most likely vectorize it on a CPU.

answered Feb 05 '15 at 16:37

2

Point 1: The access pattern of a matrix-matrix multiplication is poison for SIMD. You'll likely spend more time shuffling the values in the registers than doing sensible work. Point 2: A 4x4 matrix multiplication is not big enough to trigger caching issues, the operation will take place in level 1 cache only. Point 3: Using a functor object / lambda, you rely on the ability of the compiler to remove all the cruft that it adds. If you are lucky, you get the same performance as the handwritten code by the OP, if you are unlucky, your compiler fails at that task and you get worse performance. – cmaster - reinstate monica Feb 05 '15 at 17:14
I know the row column alternating pattern usually means jumping cache lines a lot and shuffling registers, and depending on type possibly ending across cache lines, but I have tested exactly what I mentioned and got better performance out of it, precisely in an SIMD case CPU side, than having the operation rolled out manually. I don't doubt what you're saying, but then I'd like to figure out why I would have got better results when I should have got worse. Any suggestions on where to look for it? – Feb 05 '15 at 17:23
You are right, SIMD may be faster than no SIMD, even though SIMD does not like the pattern. With my point 1, I just wanted to say that this is not exactly the use case that screams for SIMD. – cmaster - reinstate monica Feb 05 '15 at 17:59
Fair enough, I can agree with that. It's not an ideal SIMD case, sure, and OP does mention GPU ops (and matrix ops on a CUDA enabled device have books written about them), but as you say if you can get some free vectorization instead of none at all, why not? Regarding point 2, how important would you say cache line boundaries would be CPU side? – Feb 05 '15 at 18:04
2

It may make sense to implement a modified matrix multiply, A.B*, i.e. with the second matrix implicitly transposed. This will be more SIMD friendly. Then either you have the possibility to keep some of the matrices in transposed order, or, if the coefficients are stored contiguously, a 4x4 matrix transpose costs only 6 swaps. – Feb 05 '15 at 22:17
A matrix has 16 values, likely of type double, which gives it a size of 128 bytes. Afaik that's a multiple of the cache line sizes in current processors. So, if the matrices are aligned to cache lines, there shouldn't be any overhead, however, if they are misaligned, there could be significant overhead. Of course, it would help if the matrices are in a consecutive array in memory, and used in a loop. That would guarantee no overhead due to cache line size. – cmaster - reinstate monica Feb 05 '15 at 22:31

Which is more costly for the processor?

1 Answers1