SSE: not seeing a speedup by using _mm_add_epi32

Question

I would expect SSE to be faster than not using SSE. Do I need to add some additional compiler flags? Could it be that I am not seeing a speedup because this is integer code and not floating point?

invocation/output

$ make sum2
clang -O3 -msse -msse2 -msse3 -msse4.1 sum2.c ; ./a.out 123
n: 123
  SSE Time taken: 0 seconds 124 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68
NOSSE Time taken: 0 seconds 115 milliseconds
vector+vector:begin int: 1 5 127 0
vector+vector:end int: 0 64 66 68

compiler

$ clang --version
Apple LLVM version 9.0.0 (clang-900.0.37)
Target: x86_64-apple-darwin16.7.0
Thread model: posix

sum2.c

#include <stdlib.h>
#include <stdio.h>
#include <x86intrin.h>
#include <time.h>
#ifndef __cplusplus
#include <stdalign.h>   // C11 defines _Alignas().  This header defines alignas()
#endif
#define CYCLE_COUNT  10000

// add vector and return resulting value on stack
__attribute__((noinline)) __m128i add_iv(__m128i *a, __m128i *b) {
    return _mm_add_epi32(*a,*b);
}

// add int vectors via sse
__attribute__((noinline)) void add_iv_sse(__m128i *a, __m128i *b, __m128i *out, int N) {
    for(int i=0; i<N/sizeof(int); i++) { 
        //out[i]= _mm_add_epi32(a[i], b[i]); // this also works
        _mm_storeu_si128(&out[i], _mm_add_epi32(a[i], b[i]));
    } 
}

// add int vectors without sse
__attribute__((noinline)) void add_iv_nosse(int *a, int *b, int *out, int N) {
    for(int i=0; i<N; i++) { 
        out[i] = a[i] + b[i];
    } 
}

__attribute__((noinline)) void p128_as_int(__m128i in) {
    alignas(16) uint32_t v[4];
    _mm_store_si128((__m128i*)v, in);
    printf("int: %i %i %i %i\n", v[0], v[1], v[2], v[3]);
}

// print first 4 and last 4 elements of int array
__attribute__((noinline)) void debug_print(int *h) {
    printf("vector+vector:begin ");
    p128_as_int(* (__m128i*) &h[0] );
    printf("vector+vector:end ");
    p128_as_int(* (__m128i*) &h[32764] );
}

int main(int argc, char *argv[]) {
    int n = atoi (argv[1]);
    printf("n: %d\n", n);
    // sum: vector + vector, of equal length
    int f[32768] __attribute__((aligned(16))) = {0,2,4};
    int g[32768] __attribute__((aligned(16))) = {1,3,n};
    int h[32768] __attribute__((aligned(16))); 
    f[32765] = 33; f[32766] = 34; f[32767] = 35;
    g[32765] = 31; g[32766] = 32; g[32767] = 33;

    // https://stackoverflow.com/questions/459691/best-timing-method-in-c
    clock_t start = clock();
        for(int i=0; i<CYCLE_COUNT; ++i) {
            add_iv_sse((__m128i*)f, (__m128i*)g, (__m128i*)h, 32768);
        }
    int msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
    printf("  SSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
    debug_print(h);

    // process intense function again
    start = clock();
        for(int i=0; i<CYCLE_COUNT; ++i) {
            add_iv_nosse(f, g, h, 32768);
        }
    msec = (clock()-start) * 1000 / CLOCKS_PER_SEC;
    printf("NOSSE Time taken: %d seconds %d milliseconds\n", msec/1000, msec%1000);
    debug_print(h);

    return EXIT_SUCCESS;
}

Also, your comment on `add_iv` (which you fortunately never use) is wrong: a `__m128i` return value is returned in XMM0 in the x86-64 System V calling convention, not on the stack. — Peter Cordes, Oct 16 '17 at 01:18
Thanks Peter! Is there a way to prevent the compiler from using SSE instructions in certain blocks? — AG1, Oct 16 '17 at 01:19
I updated my answer with some perf analysis of the auto-vectorized code vs. your manual-vectorized loop. They both have a lot of overhead, but I think manual should have been faster unless 4k-aliasing hurt its bandwidth. So maybe turbo effects are making the 2nd loop take less wall-clock time even if it takes more CPU cycles, or maybe there's a different effect. — Peter Cordes, Oct 16 '17 at 01:55

Peter Cordes · Accepted Answer · 2017-10-16T08:30:52.867

Look at the asm: clang -O2 or -O3 probably auto-vectorizes add_iv_nosse (with a check for overlap, since you didn't use int * restrict a and so on).

Use -fno-tree-vectorize to disable auto vectorization, without stopping you from using intrinsics. I'd recommend clang -march=native -mno-avx -O3 -fno-tree-vectorize to test what I think you want to test, scalar integer vs. legacy-SSE paddd. (It works in gcc and clang. In clang, AFAIK it's a synonym for the clang-specific -fno-vectorize.)

BTW, timing both in the same executable hurts the first one, because the CPU doesn't ramp to full turbo right away. You're probably into the timed section of the code before your CPU hits full speed. (So run this a couple times back-to-back, with for i in {1..10}; do time ./a.out; done.

On Linux I'd use perf stat -r5 ./a.out to run it 5 times with performance counters (and I'd split it up so one run tested one or the other, so I could look at perf counters for the whole run.)

Code review:

You forgot stdint.h for uint32_t. I had to add that to get it to compile on Godbolt to see the asm. (Assuming clang-5.0 is something like the Apple clang version you're using. IDK if Apple's clang implies a default -mtune= option, but that would make sense because it's only targeting Mac. Also a baseline SSSE3 would make sense for 64-bit on x86-64 OS X.)

You don't need noinline on debug_print. Also, I'd recommend a different name for CYCLE_COUNT. Cycles in this context makes me think of clock cycles, so call it REP_COUNT or REPEATS or whatever.

Putting your arrays on the stack in main is probably fine. You do initialize both input arrays (to mostly zero, but add performance isn't data-dependent).

This is good, because leaving them uninitialized might mean that multiple 4k pages of each array was copy-on-write mapped to the same physical zero page, so you'd get more than the expected number of L1D cache hits.

The SSE2 loop should bottleneck on L2 / L3 cache bandwidth, since the working set it 4 * 32kiB * 3 = 384 kiB, so it's about 1.5x the 256kiB L2 cache in Intel CPUs.

clang might unroll it's auto-vectorized loop more than it does your manual intrinsics loop. That might explain better performance, since only 16B vectors (not 32B AVX2) might not saturate cache bandwidth if you're not getting 2 loads + 1 store per clock.

Update: actually the loop overhead is pretty extreme, with 3 pointer increments + a loop counter, and only unrolling by 2 to amortize that.

The auto-vectorized loop:

.LBB2_12:                               # =>This Inner Loop Header: Depth=1
    movdqu  xmm0, xmmword ptr [r9 - 16]
    movdqu  xmm1, xmmword ptr [r9]         # hoisted load for 2nd unrolled iter
    movdqu  xmm2, xmmword ptr [r10 - 16]
    paddd   xmm2, xmm0
    movdqu  xmm0, xmmword ptr [r10]
    paddd   xmm0, xmm1
    movdqu  xmmword ptr [r11 - 16], xmm2
    movdqu  xmmword ptr [r11], xmm0
    add     r9, 32
    add     r10, 32
    add     r11, 32
    add     rbx, -8               # add / jne  macro-fused on SnB-family CPUs
    jne     .LBB2_12

So it's 12 fused-domain uops, and can run at best 2 vectors per 3 clocks, bottlenecked on the front-end issue bandwidth of 4 uops per clock.

It's not using aligned loads because the compiler doesn't have that info without inlining into main where the alignment is known, and you didn't guarantee alignment with p = __builtin_assume_aligned(p, 16) or anything in the stand-alone function. Aligned loads (or AVX) would let paddd use a memory operand instead of a separate movdqu load.

The manually-vectorized loop uses aligned loads to save front-end uops, but has more loop overhead from the loop counter.

.LBB1_7:                                # =>This Inner Loop Header: Depth=1
    movdqa  xmm0, xmmword ptr [rcx - 16]
    paddd   xmm0, xmmword ptr [rax - 16]
    movdqu  xmmword ptr [r11 - 16], xmm0

    movdqa  xmm0, xmmword ptr [rcx]
    paddd   xmm0, xmmword ptr [rax]
    movdqu  xmmword ptr [r11], xmm0

    add     r10, 2               # separate loop counter
    add     r11, 32              # 3 pointer incrmeents
    add     rax, 32
    add     rcx, 32
    cmp     r9, r10              # compare the loop counter
    jne     .LBB1_7

So it's 11 fused-domain uops. It should be running faster than the auto-vectorized loop. Your timing method probably caused the problem.

(Unless mixing loads and stores is actually making it less optimal. The auto-vectorized loop did 4 loads and then 2 stores. Actually that might explain it. Your arrays are a multiple of 4kiB, and might all have the same relative alignment. So you might be getting 4k aliasing here, which means the CPU isn't sure that a store doesn't overlap a load. I think there's a performance counter you can check for that.)

See also Agner Fog's microarch guide (and instruction tables + optimization guide, and other links in the x86 tag wiki, especially Intel's optimization guide.

There's also some good SSE/SIMD beginner stuff in the sse tag wiki.

With Clang I usually use `-fno-vectorize`. Why use `-fno-tree-vectorize` (except to be consistent with GCC)? — Z boson, Oct 16 '17 at 08:17
@Zboson: I didn't know clang had a different name for that option, thanks. — Peter Cordes, Oct 16 '17 at 08:29

SSE: not seeing a speedup by using _mm_add_epi32

1 Answers1