SSE XMM dot product explanation

Question

I'm having trouble understanding a certain piece of code in assembly. The task is to find the dot product of 2 vectors using SSE arithmetic and the XMM registers. The approach is to read the vectors 4 floats at a time (meaning one xmm register will hold four in an iteration). End result of that is an xmm register, with each byte holding a sum of products (x1*y1 +...) of the given vectors.

What I don't get is the part that comes afterwards. All that is needed to sum these 'end' bytes altogether, basically sum the 4 bytes making the final register. I tried finding something on this, but to no prevail. What I'm given is beyond my understanding, I even tried writing every computation on paper, nothing made sense much. In the highlighted part, the actual sum is computed and stored in the lowest byte of the xmm0. Any insight on this is welcome.

.intel_syntax noprefix

.data
two: .int 2

.text
.global dot_product

############################################################################
##
## Function:
##
## void dot_product(float *x, float *y, int n, float *r);
##
## calculates the dot product of x and y (n lengths) and stores the result 
## in r
##
## -- float * x -- rdi -- 
## -- float * y -- rsi -- 
## -- int n -- rdx -- 
## -- float * r -- rcx -- 
##
############################################################################
dot_product:

        enter   0, 0


        mov r8, rcx
        mov r9, rdx


        mov     rax, 1
        cpuid
        test    rdx, 0x2000000
        jz not_supported


        mov     rdx, rsp
        and     rsp, 0xfffffffffffffff0
        sub     rsp, 512
        fxsave  [rsp]


        mov rcx, r9

    xorps xmm0, xmm0

next_four:

        cmp     rcx, 4
        jb next_one



        movups  xmm1, [rsi]
        movups  xmm2, [rdi]
    mulps xmm1, xmm2
    addps xmm0, xmm1



        add     rsi, 16
    add     rdi, 16  
        sub     rcx, 4
        jmp next_four

next_one:

        jrcxz finish




    movss  xmm1, [rsi]
        movss  xmm2, [rdi]
    mulss xmm1, xmm2
    addss xmm0, xmm1


        add     rsi, 4
    add     rdi, 4
        dec     rcx

        jmp next_one

finish: 

    #**summing the 4 bytes giving the actual dot product**
        movhlps xmm1, xmm0
        addps   xmm0, xmm1
        movaps  xmm1, xmm0
        shufps  xmm1, xmm1, 0b01010101
        addss   xmm0, xmm1


    movss   [r8], xmm0



        fxrstor [rsp]
        mov     rsp, rdx

done:

    leave
        ret

not_supported:


        mov rax, 1
        mov rbx, 1
        int 0x80

This loop bottlenecks on the latency of `addps` (one iter per 3 or 4 clocks), instead of the throughput (1 or 0.5 clocks) because it only uses one vector accumulator. Unrolling with multiple accumulators could increase performance by a factor of 4 on Skylake, if memory bandwidth isn't the bottleneck. — Peter Cordes, May 30 '18 at 00:38
And BTW, this hand-written asm is quite poorly optimized. The cmp/jb at the top as well as 3 add/sub and a jmp is a *lot* of loop overhead. (Still not enough to be slower than the `addps` latency bottleneck, but see [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) for more about loop structure. You'd probably be best rewriting this in C with intrinsics and letting a compiler make code, or just letting a compiler auto-vectorize a scalar loop (with OpenMP, or with `-ffast-math` to allow reordering of FP ops). JRCXZ is slower than cmp/jz — Peter Cordes, May 30 '18 at 02:38

Peter · Accepted Answer · 2018-05-29T21:04:24.593

This final code adds the 4 packed floats in xmm0 using only the normal addps/addss instructions. First it copies the 2 highest packed floats to the low floats of xmm1, so xmm0 + xmm1 can do two additions with one instruction. The 2 high floats are "do-not-care". Repeat using shufps to copy the highest of the remaining float to the lowest position. Think of the immediate 'selector' of shufps as an array index for each of the destination words. The only one that matters is the low two bits, which equals index 1, which moves 1->0. The others are all just placeholders. Then, a single addition is all that is needed.

xmm0: D | C | B | A
 +
xmm1: X | X | D | C      (movhlps xmm0)
-------------------
 =    X | X | B+D | A+C


xmm0: X | X | B+D | A+C
 +
xmm1: X | X | X   | B+D   (shufps xmm0)
-----------------------
 =    X | X | X   | A + B + C + D

Here, X means "do not care". At the end, the sum sits in the lowest position to be extracted by movss.

2 addition instructions, staying in XMM registers. Otherwise, you need 3, with more explicit moves.

More shufps detail:

Split the 0b01010101 value into 4 binary indices: 01 | 01 | 01 | 01, which in decimal is 1 | 1 | 1 | 1. Each index chooses the source (in words) from the source. This gets more complicated for the higher 2 words, as the documentation describes, but we don't care about those. The result is copying word1 to both word0 and word1, since the two low selectors are both 1.

EDIT: HADDPS is another possible implementation, adding neighbors. Two HADDPS in sequence would take care of the final sum. The only way to know which is faster is to benchmark on your target processor, not this final piece should matter much in the overall speed of the function.

Could you elaborate more on the `shufps` line? I couldn't even begin to make sense of what it does, especially its 3rd argument `0b01010101` — monolith937, May 29 '18 at 19:51
[The documentation for `shufps` describes it pretty well](https://www.felixcloutier.com/x86/SHUFPS.html). — Jason R, May 29 '18 at 20:22
`haddps` is not worth using on any CPU that supports it, unless you're optimizing for code-size or something other than raw speed, outside a hot loop. Or some weird decode effect on Nehalem or earlier (no uop cache). See [Fastest way to do horizontal float vector sum on x86](https://stackoverflow.com/q/6996764). — Peter Cordes, May 30 '18 at 00:36

SSE XMM dot product explanation

1 Answers1