Clang built-in matrix and vector extension: efficient matrix-vector multiplication

Question

I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).

I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:

typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));

The doc says (regarding the indices to access the elements of a matrix):

The first specifies the number of rows, and the second specifies the number of columns.

     Column
        |
        v
Row->| M00 M01 M02 M03 |
     | M10 M11 M12 M13 |
     | M20 M21 M22 X23 |
     | M30 M31 M32 M33 |

So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.

Then (regarding the way the elements are laid out in memory):

The elements of a value of a matrix type are laid out in column-major order without padding.

So I get from this note that if I could look at the way the elements are stored in memory I would get:

M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33

Do I get it right so far?

Does the order in which we access the elements of the matrix matter? (and am I doing it right?)

Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:

m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
    v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
    v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
    v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
    1 // ignore w element for now
}

Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load.

Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:

float4 res = {
    v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
    v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
    v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
    1 // ignore w element for now
}

(assuming I have transposed the elements before calling __builtin_matrix_column_major_load.

Is there a better way of doing it?

Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:

float4 a = {...};
float4 b = {...};
float4 c = a + b;

Then adding the 4 floats of a to the respective 4 floats of b happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?

So my second question: is there a better way of doing this?

Should I keep the matrix vectors in 4 float4 and do float4 * float4 multiplications instead?
I saw this post Matrix-Vector and Matrix-Matrix multiplication using SSE that gives an example of how to achieve mat-vector multiplication using SIMD instructions. This seems to be able to stack the elements of the matrix into __m128 and use those to get the matrix elements multiplied by the vector's elements using additional SIMD instructions such as _mm_add_ps and mm_mul_ps.
Should I just wait for this development to be more mature?

Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.

I don’t think people usually rely on these compiler-specific shenanigans. Check there for vector * row major matrix multiplication: https://github.com/microsoft/DirectXMath/blob/may2022/Extensions/DirectXMathAVX2.h#L329-L344 — Soonts, Jul 05 '22 at 01:06
Have you tried looking at the code-gen on https://godbolt.org/? As far as efficiency, that's a good way to see if it uses a minimal number of shuffles and `mulps` instructions, and stuff like that. — Peter Cordes, Jul 05 '22 at 02:59
@Soonts the github/ms is super useful. Thx. It seems like I should stick to the SSE/AVX/... extensions indeed for now (at least). — user18490, Jul 06 '22 at 13:48
@user18490 Yeah, AFAIK that’s what people usually do for both graphics (DirectXMath, GLM) and HPC (Eigen) applications. — Soonts, Jul 06 '22 at 17:32

Clang built-in matrix and vector extension: efficient matrix-vector multiplication

Does the order in which we access the elements of the matrix matter? (and am I doing it right?)

Is there a better way of doing it?

0 Answers0