Just for learning purpose: I have a function that gets three arguments as follows.
- a pointer to a 4x4 floating point matrix
- a pointer to a 4x1 floating point vector (input vector)
- a pointer to a 4x1 floating point vector (output vector).
The task of the function is simply the multiplication of matrix with the input vector and saving the result as the output vector.
The function declaration in C/C++ looks something like this:
void mult(MATRIX4x4* m, VECTOR4x1* in, VECTOR4x1* out);
Note that the MATRIX4x4 and VECTOR4x1 structs are all 16 bytes aligned.
I would like to use x86-SSE registers and instruction for the task. The code that I have written in assembly looks like:
.486
.xmm
.model flat
; .........................................................
public ___asm_vec_mat_product
; .........................................................
CODE segment dword public 'CODE' use32
ASSUME cs:CODE
; .........................................................
___asm_vec_mat_product proc near
push ebp
mov ebp, esp
mov ecx, [ebp + 8 ] ;get matrix_4x4 pointer
mov esi, [ebp + 12] ;get vector_4x1 input pointer
mov edi, [ebp + 16] ;get vector_4x1 output pointer
movaps xmm4, xmmword ptr [esi] ; load both input and output vectors in a 128bit register
movaps xmm5, xmmword ptr [edi] ; load both input and output vectors in a 128bit register
movaps xmm0, xmmword ptr [ecx] ; get matrix elements
movaps xmm1, xmmword ptr [ecx + 16] ;
movaps xmm2, xmmword ptr [ecx + 32] ;
movaps xmm3, xmmword ptr [ecx + 48] ;
; do the math -----------------------------------
movaps xmm7, xmm4
movaps xmm6, xmm4
dpps xmm7, xmm0, 11110001b
dpps xmm6, xmm1, 11110001b
insertps xmm5, xmm7, 00000000b
insertps xmm5, xmm6, 00010000b
movaps xmm7, xmm4
dpps xmm4, xmm2, 11110001b
dpps xmm7, xmm3, 11110001b
insertps xmm5, xmm4, 00100000b
insertps xmm5, xmm7, 00110000b
movaps xmmword ptr [edi], xmm5 ; save the result as a 128bits value
; do the math -----------------------------------
pop ebp
ret
___asm_vec_mat_product endp
; .........................................................
CODE ends
end
; .........................................................
It works perfectly, but the question is that whether there is more room for making it execute faster and optimize it even more with the use of x86-SSE instruction.
p.s.: I have already read the C++ and assembly optimization manuals provided by Agner Fog.
My CPU doesn't support AVX so I'm only using SSE4.1.