0

I'm doing some testing to see what the fastest way of computing the dot product of two vectors is for me, and if I can find a way that's faster than simply a.x * b.x + a.y * b.y + a.z * b.z. I've been looking at a lot of different posts on here, and I decided to try one of the functions from this answer.

I have the following function in my C file:

float hsum_sse1(__m128 v) {
    __m128 shuf = _mm_movehdup_ps(v);        // broadcast elements 3,1 to 2,0
    __m128 sums = _mm_add_ps(v, shuf);
    shuf        = _mm_movehl_ps(shuf, sums); // high half -> low half
    sums        = _mm_add_ss(sums, shuf);
    return        _mm_cvtss_f32(sums);
}

and I compiled it with gcc -std=c11 -march=native main.c, but when I did objdump to look at the generated assembly, I got a function that doesn't use the intrinsics that I specified:

00000000004005bd <hsum_sse1>:
  4005bd:   55                      push   %rbp
  4005be:   48 89 e5                mov    %rsp,%rbp
  4005c1:   48 83 ec 3c             sub    $0x3c,%rsp
  4005c5:   c5 f8 29 85 50 ff ff    vmovaps %xmm0,-0xb0(%rbp)
  4005cc:   ff 
  4005cd:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0
  4005d4:   ff 
  4005d5:   c5 f8 29 45 d0          vmovaps %xmm0,-0x30(%rbp)
  4005da:   c5 fa 16 45 d0          vmovshdup -0x30(%rbp),%xmm0
  4005df:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)
  4005e4:   c5 f8 28 85 50 ff ff    vmovaps -0xb0(%rbp),%xmm0
  4005eb:   ff 
  4005ec:   c5 f8 29 45 c0          vmovaps %xmm0,-0x40(%rbp)
  4005f1:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0
  4005f6:   c5 f8 29 45 b0          vmovaps %xmm0,-0x50(%rbp)
  4005fb:   c5 f8 28 45 b0          vmovaps -0x50(%rbp),%xmm0
  400600:   c5 f8 28 4d c0          vmovaps -0x40(%rbp),%xmm1
  400605:   c5 f0 58 c0             vaddps %xmm0,%xmm1,%xmm0
  400609:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)
  40060e:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0
  400613:   c5 f8 29 45 a0          vmovaps %xmm0,-0x60(%rbp)
  400618:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0
  40061d:   c5 f8 29 45 90          vmovaps %xmm0,-0x70(%rbp)
  400622:   c5 f8 28 45 90          vmovaps -0x70(%rbp),%xmm0
  400627:   c5 f8 28 4d a0          vmovaps -0x60(%rbp),%xmm1
  40062c:   c5 f0 12 c0             vmovhlps %xmm0,%xmm1,%xmm0
  400630:   c5 f8 29 45 f0          vmovaps %xmm0,-0x10(%rbp)
  400635:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0
  40063a:   c5 f8 29 45 80          vmovaps %xmm0,-0x80(%rbp)
  40063f:   c5 f8 28 45 f0          vmovaps -0x10(%rbp),%xmm0
  400644:   c5 f8 29 85 70 ff ff    vmovaps %xmm0,-0x90(%rbp)
  40064b:   ff 
  40064c:   c5 f8 28 45 80          vmovaps -0x80(%rbp),%xmm0
  400651:   c5 fa 58 85 70 ff ff    vaddss -0x90(%rbp),%xmm0,%xmm0
  400658:   ff 
  400659:   c5 f8 29 45 e0          vmovaps %xmm0,-0x20(%rbp)
  40065e:   c5 f8 28 45 e0          vmovaps -0x20(%rbp),%xmm0
  400663:   c5 f8 29 85 60 ff ff    vmovaps %xmm0,-0xa0(%rbp)
  40066a:   ff 
  40066b:   c5 f8 28 85 60 ff ff    vmovaps -0xa0(%rbp),%xmm0
  400672:   ff 
  400673:   c5 f8 28 c0             vmovaps %xmm0,%xmm0
  400677:   c5 fa 11 85 4c ff ff    vmovss %xmm0,-0xb4(%rbp)
  40067e:   ff 
  40067f:   8b 85 4c ff ff ff       mov    -0xb4(%rbp),%eax
  400685:   89 85 4c ff ff ff       mov    %eax,-0xb4(%rbp)
  40068b:   c5 fa 10 85 4c ff ff    vmovss -0xb4(%rbp),%xmm0
  400692:   ff 
  400693:   c9                      leaveq 
  400694:   c3                      retq   

I don't know if it makes a difference, but I'm compiling this code on a CentOS VM running on Windows. Just to make sure, I downloaded Coreinfo from here and got the following output

FPU             *       Implements i387 floating point instructions
MMX             *       Supports MMX instruction set
MMXEXT          -       Implements AMD MMX extensions
3DNOW           -       Supports 3DNow! instructions
3DNOWEXT        -       Supports 3DNow! extension instructions
SSE             *       Supports Streaming SIMD Extensions
SSE2            *       Supports Streaming SIMD Extensions 2
SSE3            *       Supports Streaming SIMD Extensions 3
SSSE3           *       Supports Supplemental SIMD Extensions 3
SSE4a           -       Supports Streaming SIMDR Extensions 4a
SSE4.1          *       Supports Streaming SIMD Extensions 4.1
SSE4.2          *       Supports Streaming SIMD Extensions 4.2

so it seems like my CPU should be able to use the SSE instructions that I wrote in the C file. I also checked my GCC version (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)), which seems compatible too. How can I get a more efficient compiled function?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Calvin Godfrey
  • 2,171
  • 1
  • 11
  • 27
  • 2
    `vmovhlps` and so on are right there. Among all the noise from not enabling optimization; you forgot to use `-O3`. (Your CPU supports AVX and you used `-march=native` so GCC emits the VEX encoding of your instructions. [AVX1 `vmovhlps` is the VEX form of SSE1 `movhlps`](https://www.felixcloutier.com/x86/movhlps) ). – Peter Cordes May 27 '20 at 01:22
  • Ohh! Sorry, missed in the original post that they compiled with -O3. – Calvin Godfrey May 27 '20 at 01:24
  • And yes, dot product of 4-element vectors can be done with `mulps` + a standard hsum. On some CPUs it can be worth using SSE4.1 `dpps`, but only barely; it decodes to similar multiply / shuffle / add instructions and is mainly useful if you want to take advantage of its masking capability to ignore one or more elements, or just to save code size. For larger arrays, you want to just vertical add `mulps` results and hsum once at the end. – Peter Cordes May 27 '20 at 01:27
  • @PeterCordes I found that function, but one thing I couldn't find an answer to is what the 3rd parameter (mask) does. Do you have any resources about it? – Calvin Godfrey May 27 '20 at 01:29
  • 1
    Also, I wouldn't recommend using such an old version of GCC. That build is 5 years old; modern versions are smarter at optimizing, especially with AVX. Most distros package more up-to-date versions of GCC and clang. – Peter Cordes May 27 '20 at 01:31
  • 1
    DPPS: asm docs explain in full detail: https://www.felixcloutier.com/x86/dpps. Also the Intrinsics guide should have info: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. It's 4 uops on Intel CPUs like Haswell and Skylake, costing only one shuffle, so it is actually better than the simple way. But it's slower on Ice Lake, 6 uops https://www.uops.info/table.html. And AMD Zen runs it as 8 uops. See https://stackoverflow.com/tags/sse/info and https://stackoverflow.com/tags/x86/info for links to docs if you didn't already know about those. – Peter Cordes May 27 '20 at 01:33
  • 1
    So I guess `dpps` is so rarely used that Intel dropped some of the dedicated hardware for it in Ice Lake, using more microcoded uops. – Peter Cordes May 27 '20 at 01:37

0 Answers0