I'm doing some testing to see what the fastest way of computing the dot product of two vectors is for me, and if I can find a way that's faster than simply a.x * b.x + a.y * b.y + a.z * b.z
. I've been looking at a lot of different posts on here, and I decided to try one of the functions from this answer.
I have the following function in my C file:
float hsum_sse1(__m128 v) {
__m128 shuf = _mm_movehdup_ps(v); // broadcast elements 3,1 to 2,0
__m128 sums = _mm_add_ps(v, shuf);
shuf = _mm_movehl_ps(shuf, sums); // high half -> low half
sums = _mm_add_ss(sums, shuf);
return _mm_cvtss_f32(sums);
}
and I compiled it with gcc -std=c11 -march=native main.c
, but when I did objdump
to look at the generated assembly, I got a function that doesn't use the intrinsics that I specified:
00000000004005bd <hsum_sse1>:
4005bd: 55 push %rbp
4005be: 48 89 e5 mov %rsp,%rbp
4005c1: 48 83 ec 3c sub $0x3c,%rsp
4005c5: c5 f8 29 85 50 ff ff vmovaps %xmm0,-0xb0(%rbp)
4005cc: ff
4005cd: c5 f8 28 85 50 ff ff vmovaps -0xb0(%rbp),%xmm0
4005d4: ff
4005d5: c5 f8 29 45 d0 vmovaps %xmm0,-0x30(%rbp)
4005da: c5 fa 16 45 d0 vmovshdup -0x30(%rbp),%xmm0
4005df: c5 f8 29 45 f0 vmovaps %xmm0,-0x10(%rbp)
4005e4: c5 f8 28 85 50 ff ff vmovaps -0xb0(%rbp),%xmm0
4005eb: ff
4005ec: c5 f8 29 45 c0 vmovaps %xmm0,-0x40(%rbp)
4005f1: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0
4005f6: c5 f8 29 45 b0 vmovaps %xmm0,-0x50(%rbp)
4005fb: c5 f8 28 45 b0 vmovaps -0x50(%rbp),%xmm0
400600: c5 f8 28 4d c0 vmovaps -0x40(%rbp),%xmm1
400605: c5 f0 58 c0 vaddps %xmm0,%xmm1,%xmm0
400609: c5 f8 29 45 e0 vmovaps %xmm0,-0x20(%rbp)
40060e: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0
400613: c5 f8 29 45 a0 vmovaps %xmm0,-0x60(%rbp)
400618: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0
40061d: c5 f8 29 45 90 vmovaps %xmm0,-0x70(%rbp)
400622: c5 f8 28 45 90 vmovaps -0x70(%rbp),%xmm0
400627: c5 f8 28 4d a0 vmovaps -0x60(%rbp),%xmm1
40062c: c5 f0 12 c0 vmovhlps %xmm0,%xmm1,%xmm0
400630: c5 f8 29 45 f0 vmovaps %xmm0,-0x10(%rbp)
400635: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0
40063a: c5 f8 29 45 80 vmovaps %xmm0,-0x80(%rbp)
40063f: c5 f8 28 45 f0 vmovaps -0x10(%rbp),%xmm0
400644: c5 f8 29 85 70 ff ff vmovaps %xmm0,-0x90(%rbp)
40064b: ff
40064c: c5 f8 28 45 80 vmovaps -0x80(%rbp),%xmm0
400651: c5 fa 58 85 70 ff ff vaddss -0x90(%rbp),%xmm0,%xmm0
400658: ff
400659: c5 f8 29 45 e0 vmovaps %xmm0,-0x20(%rbp)
40065e: c5 f8 28 45 e0 vmovaps -0x20(%rbp),%xmm0
400663: c5 f8 29 85 60 ff ff vmovaps %xmm0,-0xa0(%rbp)
40066a: ff
40066b: c5 f8 28 85 60 ff ff vmovaps -0xa0(%rbp),%xmm0
400672: ff
400673: c5 f8 28 c0 vmovaps %xmm0,%xmm0
400677: c5 fa 11 85 4c ff ff vmovss %xmm0,-0xb4(%rbp)
40067e: ff
40067f: 8b 85 4c ff ff ff mov -0xb4(%rbp),%eax
400685: 89 85 4c ff ff ff mov %eax,-0xb4(%rbp)
40068b: c5 fa 10 85 4c ff ff vmovss -0xb4(%rbp),%xmm0
400692: ff
400693: c9 leaveq
400694: c3 retq
I don't know if it makes a difference, but I'm compiling this code on a CentOS VM running on Windows. Just to make sure, I downloaded Coreinfo from here and got the following output
FPU * Implements i387 floating point instructions
MMX * Supports MMX instruction set
MMXEXT - Implements AMD MMX extensions
3DNOW - Supports 3DNow! instructions
3DNOWEXT - Supports 3DNow! extension instructions
SSE * Supports Streaming SIMD Extensions
SSE2 * Supports Streaming SIMD Extensions 2
SSE3 * Supports Streaming SIMD Extensions 3
SSSE3 * Supports Supplemental SIMD Extensions 3
SSE4a - Supports Streaming SIMDR Extensions 4a
SSE4.1 * Supports Streaming SIMD Extensions 4.1
SSE4.2 * Supports Streaming SIMD Extensions 4.2
so it seems like my CPU should be able to use the SSE instructions that I wrote in the C file. I also checked my GCC version (gcc version 4.8.5 20150623 (Red Hat 4.8.5-39) (GCC)
), which seems compatible too. How can I get a more efficient compiled function?