Although the question mentioned C++, I implemented 3x3 matrix multiplication C=A*B in C# (.NET 4.5) and ran some basic timing tests on my 64 bit windows 7 machine with optimizations. 10,000,000 multiplications took about
- 0.556 seconds with a naive implementation and
- 0.874 seconds with the laderman code from the other answer.
Interestingly, the laderman code was slower than the naive way. I didn't investigate with a profiler, but I guess the extra allocations are more costly than a few extra multiplications.
It seems current compilers are smart enough to do those optimizations for us, which is good. Here's the naive code I used, for your interest:
public static Matrix3D operator *(Matrix3D a, Matrix3D b)
{
double c11 = a.M11 * b.M11 + a.M12 * b.M21 + a.M13 * b.M31;
double c12 = a.M11 * b.M12 + a.M12 * b.M22 + a.M13 * b.M32;
double c13 = a.M11 * b.M13 + a.M12 * b.M23 + a.M13 * b.M33;
double c21 = a.M21 * b.M11 + a.M22 * b.M21 + a.M23 * b.M31;
double c22 = a.M21 * b.M12 + a.M22 * b.M22 + a.M23 * b.M32;
double c23 = a.M21 * b.M13 + a.M22 * b.M23 + a.M23 * b.M33;
double c31 = a.M31 * b.M11 + a.M32 * b.M21 + a.M33 * b.M31;
double c32 = a.M31 * b.M12 + a.M32 * b.M22 + a.M33 * b.M32;
double c33 = a.M31 * b.M13 + a.M32 * b.M23 + a.M33 * b.M33;
return new Matrix3D(
c11, c12, c13,
c21, c22, c23,
c31, c32, c33);
}
where Matrix3D is an immutable struct (readonly double fields).
The tricky thing is to come up with a valid benchmark, where you measure your code and not, what the compiler did with your code (debugger with tons of extra stuff, or optimized without your actual code since the result was never used). I usually try to "touch" the result, such that the compiler cannot remove the code under test (e.g. check matrix elements for equality with 89038.8989384 and throw, if equal). However, in the end I'm not even sure if the compiler hacks this comparison out of the way :)