x86 SSE optimization hints for 4x4 float matrix*vector product

Question

Just for learning purpose: I have a function that gets three arguments as follows.

- a pointer to a 4x4 floating point matrix
- a pointer to a 4x1 floating point vector (input vector)
- a pointer to a 4x1 floating point vector (output vector).

The task of the function is simply the multiplication of matrix with the input vector and saving the result as the output vector.

The function declaration in C/C++ looks something like this:

void mult(MATRIX4x4* m, VECTOR4x1* in, VECTOR4x1* out);

Note that the MATRIX4x4 and VECTOR4x1 structs are all 16 bytes aligned.

I would like to use x86-SSE registers and instruction for the task. The code that I have written in assembly looks like:

.486
.xmm
.model flat
; .........................................................
public ___asm_vec_mat_product
; .........................................................
CODE segment dword public 'CODE' use32
ASSUME cs:CODE
; .........................................................
___asm_vec_mat_product proc near
    push ebp
    mov  ebp, esp   

    mov ecx, [ebp + 8 ]                     ;get matrix_4x4 pointer
    mov esi, [ebp + 12]                     ;get vector_4x1 input  pointer
    mov edi, [ebp + 16]                     ;get vector_4x1 output pointer

    movaps xmm4, xmmword ptr [esi]          ; load both input and output vectors in a 128bit register 
    movaps xmm5, xmmword ptr [edi]          ; load both input and output vectors in a 128bit register 

    movaps xmm0, xmmword ptr [ecx]          ; get matrix elements
    movaps xmm1, xmmword ptr [ecx + 16]     ;
    movaps xmm2, xmmword ptr [ecx + 32]     ;
    movaps xmm3, xmmword ptr [ecx + 48]     ;       
    ; do the math -----------------------------------
        movaps   xmm7, xmm4
        movaps   xmm6, xmm4

        dpps     xmm7, xmm0, 11110001b      
        dpps     xmm6, xmm1, 11110001b
        insertps xmm5, xmm7, 00000000b  
        insertps xmm5, xmm6, 00010000b

        movaps   xmm7, xmm4

        dpps     xmm4, xmm2, 11110001b
        dpps     xmm7, xmm3, 11110001b
        insertps xmm5, xmm4, 00100000b
        insertps xmm5, xmm7, 00110000b

        movaps xmmword ptr [edi], xmm5      ; save the result as a 128bits value 
    ; do the math -----------------------------------
    pop  ebp
    ret
___asm_vec_mat_product endp
; .........................................................
CODE ends
end
; .........................................................

It works perfectly, but the question is that whether there is more room for making it execute faster and optimize it even more with the use of x86-SSE instruction.

p.s.: I have already read the C++ and assembly optimization manuals provided by Agner Fog.

My CPU doesn't support AVX so I'm only using SSE4.1.

Is there a particular reason why you don't go for AVX instead? — fuz, Sep 19 '18 at 14:03
Fair enough. I think you have to wait for the usual suspects to write an answer to your question. — fuz, Sep 19 '18 at 16:28
With `dpps` it is possible to put the result in the 0th, 1st, 2nd, or 3rd position of the sse register by selecting the right 8-bit immediate values. Then use `orps` with the results in vector 0 and 1, and `orps` with results in vector 2 and 3. You get the final result with another `orps`. Nevertheless, it might be more efficient to do the 4 four multiplies first, followed by three `hadds`, using the same idea as [here](https://stackoverflow.com/a/51275249/2439725), but only for 4 vectors and the lower 128 bits. — wim, Sep 19 '18 at 18:38
Also, consider storing the matrix as 4 columns! In that case you avoid the relatively inefficient `haddps` and `dpps` instructions. — wim, Sep 19 '18 at 18:39
@ImanAbdollahzadeh: What CPU *do* you have / what CPUs do you care about tuning for? `dpps` is slower on AMD than the `mulps`/ shuffle/`addps`/shuffle/`addps` equivalent, I think. (See Agner Fog's instruction tables: 8 uops for `dpps` on Ryzen, vs. ~5 for an efficient horizontal sum, or maybe 6 if you need `movaps`+`shufps`. [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/q/6996764). — Peter Cordes, Sep 19 '18 at 18:54
Writing a stand-alone asm function gives you the overhead of a `call`, and of passing args in memory. Also, you clobber xmm6 and 7, violating the Windows calling convention. Only the first few xmm regs are call-clobbered. You might still bottleneck on `dpps`, rather than on front-end throughput or other overhead from this not inlining, but if you care about performance you should be using intrinsics, or writing the whole loop in asm, especially if you want to use a crappy stack-args calling convention. Even `__vectorcall` wouldn't remove all the overhead. — Peter Cordes, Sep 19 '18 at 18:58
@Peter Cordes: the targeted CPU is Intel® Core™ i7-920. Based on the link that you sent, I will try to test the function again and report the output. — Iman Abdollahzadeh, Sep 19 '18 at 20:52
@wim: Thanks for the explanation. Even though the example in the given link is for AVX, but I will make the function for SSE accordingly. I did not think of making the matrix column-major. As also harold suggested. — Iman Abdollahzadeh, Sep 19 '18 at 20:57
Ok, then you have a Nehalem, and should read that section in Agner Fog's microarch pdf. And you don't care about performance on AMD, or any CPUs other than what you have? That's totally fine if you're just teaching yourself asm optimization; I did that on my Sandybridge while seriously getting into hand-tuning asm at first. (What people are suggesting with column-major and inlining / intrinsics will help on all CPUs, but some fine details of other tradeoffs you find later might end up being Nehalem specific, especially tuning for the front-end decoders since Nehalem doesn't have a uop cache) — Peter Cordes, Sep 19 '18 at 21:34
@wim: I got slightly better performance (~10%) with the use of what you suggested to put the result of `dpps` with 8-bit immediates and then several `orps`. — Iman Abdollahzadeh, Sep 20 '18 at 19:07
@ImanAbdollahzadeh On newer CPU's `orps` runs on port 0, 1, or 5. On Nehalem `orps` is going to port 5, which is the same as `insertps`, so there will be only a minor benefit, indeed. I would expect that the other suggestions (4 x `'mulps` with 3 x `haddps`, or column major storage) give more improvement. Nevertheless, the calling overhead is not negligible, so don't expect too much speedup. — wim, Sep 21 '18 at 10:41
You can also try to use `por`-s instead of some of the `orps`-es. `por` has a throughput of 3 per cycle (port 0, 1, or 5) instead of 1 per cycle. This might improve the throughput of the matrix vector product, but not the latency (which is usually of less interest). Instruction 'por' takes extra latency, due to the bypass latency from moving vectors between the integer and the floating point domain. — wim, Sep 21 '18 at 13:03

x86 SSE optimization hints for 4x4 float matrix*vector product

0 Answers0