I am writing a small graphics 3D app, to learn about Clang vector and matrix extensions (matrices still seem to be developed if I read the right versions of the doc).
I am unsure about how to write the most efficient code for a matrix-vector multiplication using these type. Using:
typedef float float4 __attribute__((ext_vector_type(4)));
typedef float m4x4 __attribute__((matrix_type(4, 4)));
The doc says (regarding the indices to access the elements of a matrix):
The first specifies the number of rows, and the second specifies the number of columns.
Column
|
v
Row->| M00 M01 M02 M03 |
| M10 M11 M12 M13 |
| M20 M21 M22 X23 |
| M30 M31 M32 M33 |
So I get that doing m[2][3] (where m is a m4x4), would give me the element that I noted X in the matrix above.
Then (regarding the way the elements are laid out in memory):
The elements of a value of a matrix type are laid out in column-major order without padding.
So I get from this note that if I could look at the way the elements are stored in memory I would get:
M00 M10 M20 M30 - M01 M11 M21 M31 - M02 M12 M22 M32 - M03 M13 X23 M33
Do I get it right so far?
Does the order in which we access the elements of the matrix matter? (and am I doing it right?)
Then I assume if I wanted to be efficient in my mat-float4 multiplication I'd need to access the elements in the way they are laid out in memory so do:
m4x3 m;
float4 v = {0.2, 0.3, 0.4, 1};
float4 res = {
v.x * m[0][0] + v.y * m[1][0] + v.z * m[2][0] + v.w * m[3][0],
v.x * m[0][1] + v.y * m[1][1] + v.z * m[2][1] + v.w * m[3][1],
v.x * m[0][2] + v.y * m[1][2] + v.z * m[2][2] + v.w * m[3][2],
1 // ignore w element for now
}
Of course it's up to me to load the right values in m[0][0], m[0][1], ... using something like __builtin_matrix_column_major_load
.
Am I over-complicating things, or should the order matter here. Is the equation above effectively better than:
float4 res = {
v.x * m[0][0] + v.y * m[0][1] + v.z * m[0][2] + v.w * m[0][3],
v.x * m[1][0] + v.y * m[1][1] + v.z * m[1][2] + v.w * m[1][3],
v.x * m[2][0] + v.y * m[2][1] + v.z * m[2][2] + v.w * m[2][3],
1 // ignore w element for now
}
(assuming I have transposed the elements before calling __builtin_matrix_column_major_load
.
Is there a better way of doing it?
Now I understand these types are being developed at the moment. Yet I understand that the whole point of these types is to take advatage of SIMD instructions. If I do:
float4 a = {...};
float4 b = {...};
float4 c = a + b;
Then adding the 4 floats of a
to the respective 4 floats of b
happens in a single cycle? So concerning the mat-float4 multiplication, because I call the elements of the float4 and m4x4 individually in my code, it seems that I wouldn't be taking advantage of any optimization in this particular case?
So my second question: is there a better way of doing this?
- Should I keep the matrix vectors in 4 float4 and do float4 * float4 multiplications instead?
- I saw this post Matrix-Vector and Matrix-Matrix multiplication using SSE that gives an example of how to achieve mat-vector multiplication using SIMD instructions.
This seems to be able to stack the elements of the matrix into
__m128
and use those to get the matrix elements multiplied by the vector's elements using additional SIMD instructions such as_mm_add_ps
andmm_mul_ps
. - Should I just wait for this development to be more mature?
Any feedback, or advice would be greatly appreciated. I am doing this as an exercise to learn about these new built-in types.