Fast rotation/transformation matrix multiplications

Question

I am looking for the most efficient way to do matrix * matrix and matrix * vector operations for 3x3 rotation and 4x4 transformation matrices in C#.

I currently store my matrices in multidimensional arrays (new double[3,3], new double[4,4]). I am not totally adverse to changing that but if possible I would like to keep the syntax. My current multiplication using the 3 standard nested for loops works fine but can be a bottleneck.

My thoughts so far:

Optimized algorithms like Strassen are not practical for these sizes
Parallelisation does not make much sense either at the level of a single 4x4 multiplication; better done at a higher level.
multidimensional arrays are (were?) slower in c# due to less efficient boundary checks, however this can be overcome with unsafe pointer arithmetic. (I am not sure how current this information is)
rotation matrices are symmetric, there might be a way to exploit that?
the biggest gains can probably be achieved by using cache-locality, making sure that values that are close together in memory are accessed together; but I am unsure how to do this.

So before I hack together my own solution using unsafe, fixed and 3 for loops, is there already a tested and optimized solution for this standard problem out there?

Or are there other optimizations that I have overlooked?

Why don't you use the existing Microsoft.Xna.Framework.Matrix struct? (http://msdn.microsoft.com/en-us/library/microsoft.xna.framework.matrix.aspx) It also has a Multiply method: http://msdn.microsoft.com/en-us/library/bb198134.aspx — pascalhein, Apr 04 '13 at 17:35
Good point, hadn't noticed that before! Checking it out, it would involve some rather large syntax changes, and switching from double to float, as well as adding a dependency to the xna framework, but it is definitely something to consider. — HugoRune, Apr 04 '13 at 17:40
The corresponding vector would be `Vector2`/`Vector3`/`Vector4`, and you can use `Vector3.Transform()` as `matrix * vector` (http://msdn.microsoft.com/en-us/library/microsoft.xna.framework.vector3.transform.aspx) — pascalhein, Apr 04 '13 at 17:41
Ugh, I hate the CLR multidimensional arrays - they tend to not only be slow as balls, but also don't play nice with many other concepts, from linq to reflection; only advice I could give (other than also mentioned Matrix types in both xna and wpf) would be to go to a linear array with explicit width and offset calcs. — JerKimball, Apr 04 '13 at 18:11
The answers to [this post](http://stackoverflow.com/q/3229442/21727) seem to suggest using jagged arrays, rather than multidimensional. — mbeckish, Apr 04 '13 at 18:53

John Alexiou · Accepted Answer · 2013-04-08T13:23:15.440

This is what I use and it works surprisingly fast.

public struct Matrix3 
{
    public readonly double a11, a12, a13;
    public readonly double a21, a22, a23;
    public readonly double a31, a32, a33;
    ...
    public vec3 Multiply(vec3 rhs)
    {
        // y= A*x
        // fill vector by element
        return new vec3(
            (a11*rhs.X+a12*rhs.Y+a13*rhs.Z),
            (a21*rhs.X+a22*rhs.Y+a23*rhs.Z),
            (a31*rhs.X+a32*rhs.Y+a33*rhs.Z));
    }

    public mat3 Multiply(mat3 rhs)
    {
        // Y = A*X
        // fill matrix by row
        return new mat3(
            (a11*rhs.a11+a12*rhs.a21+a13*rhs.a31),
            (a11*rhs.a12+a12*rhs.a22+a13*rhs.a32),
            (a11*rhs.a13+a12*rhs.a23+a13*rhs.a33),

            (a21*rhs.a11+a22*rhs.a21+a23*rhs.a31),
            (a21*rhs.a12+a22*rhs.a22+a23*rhs.a32),
            (a21*rhs.a13+a22*rhs.a23+a23*rhs.a33),

            (a31*rhs.a11+a32*rhs.a21+a33*rhs.a31),
            (a31*rhs.a12+a32*rhs.a22+a33*rhs.a32),
            (a31*rhs.a13+a32*rhs.a23+a33*rhs.a33));
    }
}

where vec3 and mat3 are aliases to my own Vector3 and Matrix3 structures that store the elements are fields. Similarly for 4 element structures. Also I have coded it the inverses like this:

    public double Determinant()
    {
        return a11*(a22*a33-a23*a32)
            +a12*(a23*a31-a21*a33)
            +a13*(a21*a32-a22*a31);
    }
    /// <summary>
    /// Solves the system of equations this*x=rhs for x
    /// </summary>
    public vec3 Solve(vec3 rhs)
    {
        double D=Determinant();
        double ID=1/D;
        return new vec3(
            (((a22*a33-a23*a32)*rhs.X+(a13*a32-a12*a33)*rhs.Y+(a12*a23-a13*a22)*rhs.Z)*ID),
            -(((a21*a33-a23*a31)*rhs.X+(a13*a31-a11*a33)*rhs.Y+(a11*a23-a13*a21)*rhs.Z)*ID),
            (((a21*a32-a22*a31)*rhs.X+(a12*a31-a11*a32)*rhs.Y+(a11*a22-a12*a21)*rhs.Z)*ID));
    }
    /// <summary>
    /// Solves the system of equations this*X = rhs for X
    /// </summary>
    public mat3 Solve(mat3 rhs)
    {
        double D=Determinant();
        double ID=1/D;
        return new mat3(
            (((a22*a33-a23*a32)*rhs.a11+(a13*a32-a12*a33)*rhs.a21+(a12*a23-a13*a22)*rhs.a31)*ID),
            (((a22*a33-a23*a32)*rhs.a12+(a13*a32-a12*a33)*rhs.a22+(a12*a23-a13*a22)*rhs.a32)*ID),
            (((a22*a33-a23*a32)*rhs.a13+(a13*a32-a12*a33)*rhs.a23+(a12*a23-a13*a22)*rhs.a33)*ID),

            -(((a21*a33-a23*a31)*rhs.a11+(a13*a31-a11*a33)*rhs.a21+(a11*a23-a13*a21)*rhs.a31)*ID),
            -(((a21*a33-a23*a31)*rhs.a12+(a13*a31-a11*a33)*rhs.a22+(a11*a23-a13*a21)*rhs.a32)*ID),
            -(((a21*a33-a23*a31)*rhs.a13+(a13*a31-a11*a33)*rhs.a23+(a11*a23-a13*a21)*rhs.a33)*ID),

            (((a21*a32-a22*a31)*rhs.a11+(a12*a31-a11*a32)*rhs.a21+(a11*a22-a12*a21)*rhs.a31)*ID),
            (((a21*a32-a22*a31)*rhs.a12+(a12*a31-a11*a32)*rhs.a22+(a11*a22-a12*a21)*rhs.a32)*ID),
            (((a21*a32-a22*a31)*rhs.a13+(a12*a31-a11*a32)*rhs.a23+(a11*a22-a12*a21)*rhs.a33)*ID));
    }

score 1 · Answer 2 · answered Apr 08 '13 at 11:00

1

If you want it to be in Microsoft C# for performace I would;

Roll out the loops. Do not use loops but write it all out This is feasible for these smaller fixed size multiplies.
After applying the first suggestion, try an Unsafe Fixed version (this still matters for many fast array accesses)

For Mono, the Mono.SIMD library might be worth a look.

For parallelism using a GPU would be a good fit if you can offload many of these at the same time. For C# I'd look into http://www.hybriddsp.com/Products/CUDAfyNET.aspx but there might be others. I have not yet done any GPU stuff from C#, but this one would be my starting point.

answered Apr 08 '13 at 11:00

IvoTops

3,463
17
18

1

It is not worth parallelizing 3x3 and 4x4 matrix operations. Only with 1000 or more elements there might be some benefit (from my experience). Also unsafe fixed struct in practice are not worth it for this application. – John Alexiou Apr 08 '13 at 13:21
1

@ja72 Therefore I said 'if you can offload many of these'. In that case you can do hundreds of 4x4 in parallel. I did not say that you should parallelize the individual operations for once 4x4. – IvoTops Apr 08 '13 at 17:45

Fast rotation/transformation matrix multiplications

2 Answers2