performance of intrinsic functions with sse

Question

I am currently getting started with SSE. The answer to my previous question regarding SSE ( Mutiplying vector by constant using SSE ) brought me to the idea to test the difference between using intrinsics like _mm_mul_ps()and just using 'normal operators' (not sure what the best term is) like *.

So i wrote two testing cases which only differ in way the result is calculated:
Method 1:

int main(void){
    float4 a, b, c;

    a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

    printf("method 1\n");
    c.v = a.v + b.v;      // <---
    print_vector(a);
    print_vector(b);
    printf("1.a) Computed output 1: ");
    print_vector(c);

    exit(EXIT_SUCCESS);
}

Method 2:

int main(void){
    float4 a, b, c;

    a.v = _mm_set_ps(1.0f, 2.0f, 3.0f, 4.0f);
    b.v = _mm_set_ps(-1.0f, -2.0f, -3.0f, -4.0f);

    printf("\nmethod 2\n");
    c.v = _mm_add_ps(a.v, b.v);      // <---
    print_vector(a);
    print_vector(b);
    printf("1.b) Computed output 2: ");
    print_vector(c);

    exit(EXIT_SUCCESS);
}

both testing cases share the following:

typedef union float4{
    __m128  v;
    float   x,y,z,w;
} float4;

void print_vector (float4 v){
    printf("%f,%f,%f,%f\n", v.x, v.y, v.z, v.w);
}

So to compare the code generated for both cases i compiled using:
gcc -ggdb -msse -c t_vectorExtensions_method1.c

Which resulted in (showing only the part where the two vectors are added -which differs):
Method 1:

    c.v = a.v + b.v;
  a1:   0f 57 c9                xorps  %xmm1,%xmm1
  a4:   0f 12 4d d0             movlps -0x30(%rbp),%xmm1
  a8:   0f 16 4d d8             movhps -0x28(%rbp),%xmm1
  ac:   0f 57 c0                xorps  %xmm0,%xmm0
  af:   0f 12 45 c0             movlps -0x40(%rbp),%xmm0
  b3:   0f 16 45 c8             movhps -0x38(%rbp),%xmm0
  b7:   0f 58 c1                addps  %xmm1,%xmm0
  ba:   0f 13 45 b0             movlps %xmm0,-0x50(%rbp)
  be:   0f 17 45 b8             movhps %xmm0,-0x48(%rbp)

Method 2:

    c.v = _mm_add_ps(a.v, b.v);
  a1:   0f 57 c0                xorps  %xmm0,%xmm0
  a4:   0f 12 45 a0             movlps -0x60(%rbp),%xmm0
  a8:   0f 16 45 a8             movhps -0x58(%rbp),%xmm0
  ac:   0f 57 c9                xorps  %xmm1,%xmm1
  af:   0f 12 4d b0             movlps -0x50(%rbp),%xmm1
  b3:   0f 16 4d b8             movhps -0x48(%rbp),%xmm1
  b7:   0f 13 4d f0             movlps %xmm1,-0x10(%rbp)
  bb:   0f 17 4d f8             movhps %xmm1,-0x8(%rbp)
  bf:   0f 13 45 e0             movlps %xmm0,-0x20(%rbp)
  c3:   0f 17 45 e8             movhps %xmm0,-0x18(%rbp)
/* Perform the respective operation on the four SPFP values in A and B.  */

extern __inline __m128 __attribute__((__gnu_inline__, __always_inline__, __artificial__))
_mm_add_ps (__m128 __A, __m128 __B)
{
  return (__m128) __builtin_ia32_addps ((__v4sf)__A, (__v4sf)__B);
  c7:   0f 57 c0                xorps  %xmm0,%xmm0
  ca:   0f 12 45 e0             movlps -0x20(%rbp),%xmm0
  ce:   0f 16 45 e8             movhps -0x18(%rbp),%xmm0
  d2:   0f 57 c9                xorps  %xmm1,%xmm1
  d5:   0f 12 4d f0             movlps -0x10(%rbp),%xmm1
  d9:   0f 16 4d f8             movhps -0x8(%rbp),%xmm1
  dd:   0f 58 c1                addps  %xmm1,%xmm0
  e0:   0f 13 45 90             movlps %xmm0,-0x70(%rbp)
  e4:   0f 17 45 98             movhps %xmm0,-0x68(%rbp)

Obviously the code generated when using the intrinsic _mm_add_ps() is much larger. Why is this? Shouldn't it result in better code?

After compiling with `gcc -ggdb -msse -O3 -c t_vectorExtensions_method1.c`, both cases generate the exact same output. So there is no benefit in using intrinsics. Is this always the case? — Emanuel Ey, Mar 11 '11 at 18:45

Paul R · Accepted Answer · 2011-03-11T18:42:32.043

2

All that really matters is the addps. In a more realistic use case, where you might be, say, adding two large vectors of floats in a loop, the body of the loop will just contain addps, two loads and a store, and some scalar integer instructions for address arithmetic. On a modern superscalar CPU many of these instructions will execute in parallel.

Note also that you're compiling with optimisation disabled, so you won't get particularly efficient code. Try gcc -O3 -msse3 ....

edited Mar 11 '11 at 18:42

answered Mar 11 '11 at 18:37

Paul R

208,748
37
389
560

Compiled both cases with `-O3`, for sse, sse2 and sse3. All six cases generate the exact same machine code. (since this is a simple addition, that doesn't really surprise me). But: does it make sense to use intrinsic functions when there doesn't seem to be a difference? Using them makes code rather unreadable. – Emanuel Ey Mar 11 '11 at 19:02
1

@emanuel: you can't extrapolate one very simple test case to apply to the general case. Most modern compilers can do pretty well at auto vectorizing code(esp. ICC), which is what your seeing, but most fall appart on complex code or edge cases. Imo, stick to intrinsics, keeps you code clear and doesn't leave you reliant on the compiler to do the right thing – Necrolis Mar 11 '11 at 19:22
@Emanuel: the ability to e.g. add vector types with `+`, etc, is gcc-specific, and is only a useful shorthand for a small subset of possible SSE operations. You really need to get familiar with the intrinsics if you're going to be doing any substantial SIMD programming. – Paul R Mar 11 '11 at 19:25

performance of intrinsic functions with sse

1 Answers1

Linked