4

I would like to vectorize the following operation:

V[i+1] = max(V[i] - c, V[i+1]) for i=1 to n-1 (V[0] = 0)

The corresponding naive pseudo-code is:

for (i=0; i < n; i++) {
  if (V[i]-c > V[i+1]) V[i+1] = V[i]-c
}

Which SIMD instructions could be useful ?

nalzok
  • 14,965
  • 21
  • 72
  • 139
Karl Forner
  • 4,175
  • 25
  • 32
  • 3
    I think that serial dependency is probably going to kill any chance of using SIMD. The only possibility I can see is if the test condition is known to be true for only a relatively small fraction of the array elements - might that be the case? – Paul R Mar 18 '16 at 13:12
  • None. You need to break the dependency first. – user3528438 Mar 18 '16 at 13:12
  • 2
    You could SIMD this for 4 arrays at once, if they were stored interleaved together in a single array of structs, but that's unlikely to be useful. The only reason I'm not upvoting this question is that there's no useful answer, so it fails the "is useful" criterion. It's clear and well-asked, with no useless code cluttering up the part you want to vectorize. – Peter Cordes Mar 18 '16 at 14:24
  • For the serial algorithm you have, it would be great to ensure that no branch is inserted for calculating the maximal value. Either [cmov](http://stackoverflow.com/questions/2039730/fastest-way-to-find-out-minimum-of-3-numbers) instruction or [bit magic](http://stackoverflow.com/a/15288303/556899) can be used to ensure that. – stgatilov Mar 19 '16 at 17:46
  • 1
    @PeterCordes, you should have upvoted the question. – Z boson Mar 19 '16 at 19:56
  • @PaulR, people give up too easily when seeing serial dependencies. This operations can be done with SIMD. – Z boson Mar 19 '16 at 19:59
  • 1
    @Zboson: ah - I did wonder whether the SIMD approach to prefix sum could be used here but I didn't immediately see how it would be applied. – Paul R Mar 19 '16 at 20:02
  • @PaulR, to be fair it was not easy to get. I spent too much time on this but on the other hand serial dependencies are challenging and therefore interesting. – Z boson Mar 19 '16 at 20:05
  • 1
    @PaulR, it would be intersting to benchmark this now (I am not going to do that). I expect that in the special case `c=0` the subtraction is not necessary and I think even SSE will be better (I saw that with the prefix sum). But maybe even with the subtraction SSE is better due to breaking the dependency chain. – Z boson Mar 19 '16 at 20:07
  • Yes, it would certainly be worth benchmarking this to see how significant any potential speed-up is. It would also be helpful if the OP had answered my earlier question about the nature of the data - if the cases where propagation is required are fairly sparse then a simple and efficient solution is possible of course – Paul R Mar 19 '16 at 20:12
  • 1
    @PaulR, I benchmarked the code. The SSE version is about 2.5 times faster than the serial code http://coliru.stacked-crooked.com/a/a854126814eed839. I updated my answer with the code. – Z boson Mar 20 '16 at 07:04
  • @PaulR, the reason that I used unsigned ints is because I shift in zeros and then do `max`. This does not work with negative numbers. Do you know a good way to fix this? One way instead of shifting in zeros is to leave the original values unchanges. For example if I have the vector `(1,2,3,4)` if instead of shifting right two with zeros to `(0,0,1,2)` I had `(1,2,1,2)` it would work. This would requiring some `and` and `blend` operations but that seems inefficient. – Z boson Mar 20 '16 at 08:53
  • @Zboson: not sure I fully understand but couldn't you use `_mm_shuffle_epi32` to get the 1,2,3,4 -> 1,2,1,2 operation ? Nice work on the 2.5x benchmarking result, BTW. – Paul R Mar 20 '16 at 09:17
  • 1
    @PaulR, I think you're right! I can use `_mm_shuffle_epi32` instead of `_mm_slli_si128` and then my method would work with signed int and float (with a few more changes) but still the same efficiency! I will have to look into that later today. – Z boson Mar 20 '16 at 09:22
  • @PaulR: it is difficult to say how often the condition will be true. It really depends on the input data, and I would have to do simulations to answer this. – Karl Forner Mar 20 '16 at 15:10
  • @KarlForner: thanks for the clarification - we'd better make no assumptions about the data then. – Paul R Mar 20 '16 at 16:22
  • 2
    @PaulR, your idea worked! I used `_mm_shuffle_epi32(x,0x90)` and `_mm_shuffle_epi32(x,0x44)` now. The reason I used `_mm_slli_si128` is because I needed to shift in zeros for the prefix sum but in this case I don't want that. Now I can use signed integers but my method requires `c>=0`. I am sure it could be fixed for `c<0` though. – Z boson Mar 20 '16 at 19:53
  • 1
    @PaulR, I updated my answer based on your suggested (see the end of my answer). – Z boson Mar 20 '16 at 20:03
  • @PeterCordes: I think you are right, the best way would be to interleave the data array. – Karl Forner Mar 22 '16 at 14:30

1 Answers1

7

This can be done with SIMD. The solution is similar to the solution for the prefix sum with SIMD.

Within a SIMD register the number of iterations goes as O(Log2(simd_width)). Each iteration requires: one shift, one subtraction, and one max. For example with SSE it requires Log2(4) = 2 iterations. You can apply your function on four elements like this:

__m128i foo_SSE(__m128i x, int c) {
    __m128i t, c1, c2;
    c1 = _mm_set1_epi32(c);
    c2 = _mm_set1_epi32(2*c);

    t = _mm_slli_si128(x, 4);
    t = _mm_sub_epi32(t, c1);
    x = _mm_max_epi32(x, t);

    t = _mm_slli_si128(x, 8);
    t = _mm_sub_epi32(t, c2);
    x = _mm_max_epi32(x, t);
    return x;
}

Once you have the result of a SIMD register you need to apply the "carry" to the next register. For example let's say you have an array a of eight elements. You load SSE register x1 and x2 like this

__m128i x1 = _mm_loadu_si128((__m128i*)&a[0]);
__m128i x2 = _mm_loadu_si128((__m128i*)&a[4]);

Then to apply your function to all eight elements you would do

__m128i t, s;
s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);

x1 = foo_SSE(x1,c);
x2 = foo_SSE(x2,c);
t = _mm_shuffle_epi32(x1, 0xff);
t = _mm_sub_epi32(t,s);
x2 = _mm_max_epi32(x2,t);

Note that c1, c2, and s are all constants within a loop so they only need to be calculated once.

In general you could apply your function to an unsigned int array a like this with SSE (with n a multiple of 4):

void fill_SSE(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);
    for(int i=0; i<n/4; i++) {
        __m128i x = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i out = foo_SSE(x, c);
        out = _mm_max_epi32(out,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], out);
        offset = _mm_shuffle_epi32(out, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

I went ahead and profiled this SSE code. It's about 2.5 times faster than the serial version.

Another major advantage to this method besides going as log2(simd_width) is that it break the dependency chain so that multiple SIMD operations can go at the same time (using multiple ports) instead of waiting for the previous result. The serial code is latency bound.

The current code works for unsigned integers but you could generalize it to signed integers as well as floats.

Here is the general code I used to test this. I created a bunch of abstract SIMD functions to emulate SIMD hardware before I implemented the SSE version.

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <x86intrin.h>
#include <omp.h>

__m128i foo_SSE(__m128i x, int c) {
    __m128i t, c1, c2;
    c1 = _mm_set1_epi32(c);
    c2 = _mm_set1_epi32(2*c);

    t = _mm_slli_si128(x, 4);
    t = _mm_sub_epi32(t, c1);
    x = _mm_max_epi32(x, t);

    t = _mm_slli_si128(x, 8);
    t = _mm_sub_epi32(t, c2);
    x = _mm_max_epi32(x, t);
    return x;
}

void foo(int *a, int n, int c) {
    for(int i=0; i<n-1; i++) {
        if(a[i]-c > a[i+1]) a[i+1] = a[i]-c;
    }
}

void broad(int *a, int n, int k) {
    for(int i=0; i<n; i++) a[i] = k;
}

void shiftr(int *a, int *b, int n, int m) {
    int i;
    for(i=0; i<m; i++) b[i] = a[i];
    for(; i<n; i++) b[i] = a[i-m];
}

/*
void shiftr(int *a, int *b, int n, int m) {
    int i;
    for(i=0; i<m; i++) b[i] = 0;
    for(; i<n; i++) b[i] = a[i-m];
}
*/

void sub(int *a, int n, int c) {
    for(int i=0; i<n; i++) a[i] -= c;
}


void max(int *a, int *b, int n) {
    for(int i=0; i<n; i++) if(b[i]>a[i]) a[i] = b[i];
}

void step(int *a, int n, int c) {
    for(int i=0; i<n; i++) {
        a[i] -= (i+1)*c;
    }
}

void foo2(int *a, int n, int c) {
    int b[n];
    for(int m=1; m<n; m*=2) {
        shiftr(a,b,n,m);
        sub(b, n, m*c);
        max(a,b,n);
        //printf("n %d, m %d; ", n,m ); for(int i=0; i<n; i++) printf("%2d ", b[i]); puts("");
    }
}

void fill(int *a, int n, int w, int c) {
    int b[w], offset[w];
    broad(offset, w, -1000);
    for(int i=0; i<n/w; i++) {
        for(int m=1; m<w; m*=2) {
            shiftr(&a[w*i],b,w,m);
            sub(b, w, m*c);
            max(&a[w*i],b,w);
        }
        max(&a[w*i],offset,w);
        broad(offset,w,a[w*i+w-1]);
        step(offset, w, c);
    }
}


void fill_SSE(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);
    for(int i=0; i<n/4; i++) {
        __m128i x = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i out = foo_SSE(x, c);
        out = _mm_max_epi32(out,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], out);
        offset = _mm_shuffle_epi32(out, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

void fill_SSEv2(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/4; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i t1;

        t1 = _mm_slli_si128(x1, 4);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_slli_si128(x1, 8);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

void fill_SSEv3(int *a, int n, int c) {
    __m128i offset = _mm_setzero_si128();
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/8; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[8*i]);
        __m128i x2 = _mm_loadu_si128((__m128i*)&a[8*i+4]);
        __m128i t1, t2;

        t1 = _mm_slli_si128(x1, 4);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_slli_si128(x1, 8);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        t2 = _mm_slli_si128(x2, 4);
        t2 = _mm_sub_epi32 (t2, c1);
        x2 = _mm_max_epi32 (x2, t2);

        t2 = _mm_slli_si128(x2, 8);
        t2 = _mm_sub_epi32 (t2, c2);
        x2 = _mm_max_epi32 (x2, t2);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[8*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);

        x2 = _mm_max_epi32(x2,offset);
        _mm_storeu_si128((__m128i*)&a[8*i+4], x2);
        offset = _mm_shuffle_epi32(x2, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

int main(void) {
    int n = 8, a[n], a1[n], a2[n];
    for(int i=0; i<n; i++) a[i] = i;

    /*
    a[0] = 1, a[1] = 0;
    a[2] = 2, a[3] = 0;
    a[4] = 3, a[5] = 13;
    a[6] = 4, a[7] = 0;
    */


    a[0] = 5, a[1] = 6;
    a[2] = 7, a[3] = 8;
    a[4] = 1, a[5] = 2;
    a[6] = 3, a[7] = 4;

    for(int i=0; i<n; i++) printf("%2d ", a[i]); puts("");
    for(int i=0; i<n; i++) a1[i] = a[i], a2[i] = a[i];

    int c = 1;
    foo(a1,n,c);
    foo2(a2,n,c);
    for(int i=0; i<n; i++) printf("%2d ", a1[i]); puts("");
    for(int i=0; i<n; i++) printf("%2d ", a2[i]); puts("");


    __m128i x1 = _mm_loadu_si128((__m128i*)&a[0]);
    __m128i x2 = _mm_loadu_si128((__m128i*)&a[4]);
    __m128i t, s;
    s = _mm_setr_epi32(c, 2*c, 3*c, 4*c);

    x1 = foo_SSE(x1,c);
    x2 = foo_SSE(x2,c);
    t = _mm_shuffle_epi32(x1, 0xff);
    t = _mm_sub_epi32(t,s);
    x2 = _mm_max_epi32(x2,t);

    int a3[8];
    _mm_storeu_si128((__m128i*)&a3[0], x1);
    _mm_storeu_si128((__m128i*)&a3[4], x2);
    for(int i=0; i<8; i++) printf("%2d ", a3[i]); puts("");

    int w = 8;
    n = w*1000;
    int f1[n], f2[n];
    for(int i=0; i<n; i++) f1[i] = rand()%1000;

    for(int i=0; i<n; i++) f2[i] = f1[i];
    //for(int i=0; i<n; i++) printf("%2d ", f1[i]); puts("");
    foo(f1, n, c);
    //fill(f2, n, 8, c);
    fill_SSEv3(f2, n, c);
    printf("%d\n", memcmp(f1,f2,sizeof(int)*n));
    for(int i=0; i<n; i++) {
        //    if(f1[i] != f2[i]) printf("%d\n", i);
    }
    //for(int i=0; i<n; i++) printf("%2d ", f1[i]); puts("");
    //for(int i=0; i<n; i++) printf("%2d ", f2[i]); puts("");

    int r = 200000;
    double dtime;
    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) fill_SSEv2(f2, n, c);
    //for(int i=0; i<r; i++) foo(f1, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) fill_SSEv3(f2, n, c);
    //for(int i=0; i<r; i++) foo(f1, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);

    dtime = -omp_get_wtime();
    for(int i=0; i<r; i++) foo(f1, n, c);
    //for(int i=0; i<r; i++) fill_SSEv2(f2, n, c);
    dtime += omp_get_wtime();
    printf("time %f\n", dtime);
}

Based on a comment by Paul R I was able to fix my function to work with signed integers. However, it requires c>=0. I am sure it could be fixed to work for c<0.

void fill_SSEv2(int *a, int n, int c) {
    __m128i offset = _mm_set1_epi32(0xf0000000);
    __m128i s = _mm_setr_epi32(1*c, 2*c, 3*c, 4*c);
    __m128i c1 = _mm_set1_epi32(1*c);
    __m128i c2 = _mm_set1_epi32(2*c);
    for(int i=0; i<n/4; i++) {
        __m128i x1 = _mm_loadu_si128((__m128i*)&a[4*i]);
        __m128i t1;

        t1 = _mm_shuffle_epi32(x1, 0x90);
        t1 = _mm_sub_epi32 (t1, c1);
        x1 = _mm_max_epi32 (x1, t1);

        t1 = _mm_shuffle_epi32(x1, 0x44);
        t1 = _mm_sub_epi32 (t1, c2);
        x1 = _mm_max_epi32 (x1, t1);

        x1 = _mm_max_epi32(x1,offset);
        _mm_storeu_si128((__m128i*)&a[4*i], x1);
        offset = _mm_shuffle_epi32(x1, 0xff);
        offset = _mm_sub_epi32(offset,s);
    }
}

This method should be easily extended to floats now.

Community
  • 1
  • 1
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • 1
    `foo_SSE` is 9 insn instead of 11 if you do [`c2 = _mm_add_epi32(c1,c1);`](http://goo.gl/T58dsL). Using a separate `set1` results in a separate movd/broadcast with gcc and clang. (With only one set1, gcc makes the odd choice of storing and using AVX1 `vbroadcastss xmm, mem` even with `-mavx2` where it could do what clang does: `vmovd xmm, edi` / `vpbroadcastd xmm,xmm`. It does that when it needs to do two broadcasts [(but with `vpshufd` instead of `vpbroadcastd`)](http://goo.gl/a4XQcI)) – Peter Cordes Mar 19 '16 at 20:51
  • 2
    @PeterCordes, `c1` and `c2` are constants within the loop so they don't have to be recalculated every iteration `foo_SSE` is called in `fill_SSE`. I only put them in `foo_SSE` because it was the first pass that got the code working. If I benchmarked the code I would pull them out (or make them const or whatever it took to convince the compiler to only do them once). Also if I benchmarked the code I would unroll the loop in `fill_SSE` as well. – Z boson Mar 19 '16 at 21:21
  • 1
    @PeterCordes, I updated my code the SSE version is about 2.5 times faster than the serial code http://coliru.stacked-crooked.com/a/a854126814eed839 . However, unrolling the loop did not help. I think the dependency is only on `offset` and since there are so it's not a big deal. – Z boson Mar 20 '16 at 07:06
  • 2
    Cool. I'll have to remember that there's a useful technique for serial dependencies like this. Very nice answer. In my last comment, I'd forgotten the details of the original problem, and was just looking at your building-block function, obviously forgetting that the c2 broadcast could and would be hoisted out of the loop. – Peter Cordes Mar 20 '16 at 08:28
  • 1
    thanks a lot for this answer. I'll take some time tomorrow to study and test it. – Karl Forner Mar 20 '16 at 15:11
  • 1
    @KarlForner, what type are you interested in? Unsigned int, signed int, float, double,...? I just fixed my function to work with signed integers however it requires `c>0`. – Z boson Mar 20 '16 at 19:51
  • @Zboson: nice work! I guess this general approach might work for a general class of algorithms such as prefix sum etc, where there is serial dependency. – Paul R Mar 20 '16 at 22:16
  • @PaulR, you have read my thoughts. I have been thinking the same thing for a while. I would like to ask a question about it but it's hard to make a question that's not to broad or too much about math to get past the SO filter. But I think it's certainly an interesting question. It's sorta a question about decomposing dependencies into parts that are not dependent. This is possible with some operations such as the prefix sum but not with others such as the Fibonacci sequence (at least not that I am aware of). So it would be interesting to know if there is a way to determine this. – Z boson Mar 21 '16 at 12:25
  • 1
    I have a feeling that this may come under the general subject category of "affine loop transformations", since this deals with restructuring loops and dealing with dependencies. Might be worth asking on cstheory.SE rather than here, but as you say, hard to put together a suitably specific question. – Paul R Mar 21 '16 at 12:33
  • 2
    @PaulR, on affine loop transforms I found this http://www.sciencedirect.com/science/article/pii/S0010465501002326 What is Computer Physics? That sounds like what I should be doing as I am a physicist who now thinks more about computers. – Z boson Mar 21 '16 at 13:26
  • @Zboson: thanks a lot. On my platform, the gain is only by 25% (Xeon X5570@2.93GHz, gcc 4.8.4) – Karl Forner Mar 21 '16 at 14:26
  • @Zboson: the most interesting types for me would be unsigned ints on one byte (i.e. 16 uints per vector) – Karl Forner Mar 21 '16 at 14:27
  • 2
    @KarlForner, you should have stated you are interested in bytes. That makes a big difference especially as it goes as log2(simd_width) which for SSE with bytes is log2(16)=4. See [this answer](http://stackoverflow.com/a/10589306/2542702). The fact that you only see a 25% gain could be for many other rason. I got a 2.5x gain by only looking at arrays that fit in the L1-cache. Your function becomes memory bandwidth as the size increases. – Z boson Mar 21 '16 at 14:33
  • @zboson: I got 25% running your version from the post or from http://coliru.stacked-crooked.com/a/a854126814eed839 – Karl Forner Mar 21 '16 at 14:53
  • 1
    @KarlForner, you mean you ran my exact code on your computer? What compile options did you use? Did you compile with `-O3 -msse4.1 -fopenmp`? Your CPU is based on Nahlem. The same computer as Coliru I think. It makes no sense you would only get 25% for similar hardware. – Z boson Mar 21 '16 at 15:20
  • 1
    @KarlForner, btw, there was a small (bug critical bug) in my code on coliru. I was reading one value past the array in `foo`. I should have read to `n-1` but I read to `n`. Here is the correct version http://coliru.stacked-crooked.com/a/f73e24b92a200f65 – Z boson Mar 21 '16 at 15:22
  • 1
    @Zboson: I'd never heard of "computer physics" either, but apparently there is a journal: https://en.wikipedia.org/wiki/Computer_Physics_Communications – Paul R Mar 21 '16 at 15:25
  • @Zboson: using your latest code from http://coliru.stacked-crooked.com/a/f73e24b92a200f65, compiling: gcc -O3 -msse4.1 -fopenmp coliru_latest.cpp -o coliru_latest: time 0.906254 time 0.785428 time 0.990433 – Karl Forner Mar 21 '16 at 16:37
  • @KarlForner, makes no sense to me. This is strange. I even see a significant speed up on my haswell system as well. BTW, this is C code not C++ (I am using VLA which C++ does not support). But this should not matter (BTW you can compile with -x c to force a .cpp to be treated as C). – Z boson Mar 21 '16 at 17:56
  • @KarlForner, what OS are you running? Your GCC is a bit old but I highly doubt that matters either. Are you testing on a virtual machine? Can you find another system to test on? – Z boson Mar 21 '16 at 17:58
  • @KarlForner, you could try using the aligned loads and stores (e.g. change `_mm_loadu_si128` to `_mm_load_si128` but since your Xeon processor is based on Nahlem this should not matter. – Z boson Mar 21 '16 at 18:07
  • @Zboson: I tried on a macbook pro running ubuntu and gcc 4.8.4: time 0.702583 time 0.677673 time 1.064852 – Karl Forner Mar 22 '16 at 12:18
  • @Zboson: Os=Ubuntu 14.04.3 bare metal. I've tried aligning f2[] (int f2[n] __attribute__((aligned(16)));) and replacing the loadu by load and storeu with store, but no effects on the timings. – Karl Forner Mar 22 '16 at 12:46
  • @KarlForner on my haswell system (Linux GCC 5.3 running in Virtual Box under Windows 10) I get time 0.651753 time 0.613658 time 1.037680. That's more consistent with what you're seeing. I am starting to suspect coliru. On the other hand at least I see that unrolling the loop (on my system and from the numbers you have reported) does help some like I expected. – Z boson Mar 22 '16 at 12:55
  • @KarlForner, strange, I unrolled four times and now the sequential version goes much slower: time 0.652624, time 0.623082, time 0.614802, time 1.458644. Something is fishy. – Z boson Mar 22 '16 at 13:09
  • @KarlForner, okay, when I timed `foo` it writes to `f1` so maybe it was not in the cache. When I tell `foo` to write to `f2` like the other tests then the times are correct again: time 0.613808 time 0.631658 time 0.612398 time 1.014265. However, this did not help coliru. It's still more than 2.5x faster on coliru. Benchmarking can be a pain... – Z boson Mar 22 '16 at 13:13
  • @PeterCordes, we are seeing some big differences between coliru and other systems we test on. I have seen coliru up to 3x faster but on other systems it can be less than 50%. Do you have any clue why? http://coliru.stacked-crooked.com/a/fbe547bfd7c4f3aa – Z boson Mar 22 '16 at 14:28
  • @Zboson: I'll have a look soon. Haven't had a chance to dig into it. I did find out that the OP's Xeon [is a Nehalem](http://www.cpu-world.com/CPUs/Xeon/Intel-Xeon%20X5570%20-%20AT80602000765AA%20(BX80602X5570).html). – Peter Cordes Mar 23 '16 at 11:31
  • @Zboson: That benchmark on coliru runs 200k iterations on `int f2[4000]`, with no alignment requested. That does fit in L1 cache. It's probably slow enough that unaligned inputs don't bottleneck on loads/stores (few memory accesses per clock). It looks like we're getting a ~2.5x speedup from SSE relative to the scalar version on Coliru, but IDK if you should trust wall-clock times on a server like Coliru. My SnB is broken, so I could only try this myself on a Core2, which probably isn't useful. Can you summarize exactly what still needs explaining? Specific A and B results to compare? – Peter Cordes Mar 29 '16 at 06:03
  • @PeterCordes, thank you for looking into this!!! Since Coliru is the outline I am inclined to exclude it. That's too bad because then the speed up is not as significant but nevertheless it's still outside the margin of error so clearly an improvement. I don't think unaligned loads are an issue. The arrays are stack allocated so they should be 16-byte aligned and since Nehalem unaligned instructions are the same speed as the aligned instructions on aligned memory (the unaligned instructions can't fold and fuse though). – Z boson Mar 29 '16 at 07:17
  • I notice that gcc5.3's code for the scalar version has a loop-carried dep chain that includes a store-reload when the `if()` condition is true. Unlike the SIMD version, there's no store at all when the condition is false, breaking the dep chain. (The SIMD version keeps the old value in a register, of course). Your array is probably too big for the branch-predictor to train itself on that pattern and get good prediction rates, so IDK why it isn't bad. I would have expected branch mispredicts to hurt the scalar version. What kind of uops per clock are you seeing with each on Haswell? – Peter Cordes Mar 29 '16 at 07:34
  • 1
    Actually, what happens on **repeated runs over the same array**? Eventually the data settles down so the branch is always not-taken, right? I think this is why the scalar version looks good in benchmarks: you're repeatedly applying the in-place modification to the same array for many repeats after the task has become trivial. I bet with a `memcpy(f2,f1, size)` inside the timed loop, you'd see an advantage for the SSE version. With your current code, running just the scalar loop, I get a branch mispredict rate of 0.01% for the whole 1.5s benchmark, and 3.15 insn per cycle (core2) – Peter Cordes Mar 29 '16 at 07:42
  • @PeterCordes, if I put `memcpy(f2,f1, size)` in my timing loop then I will be timing the memcpy as well. That would be fine to show the absolute time difference but what about the relative time difference? I mean if `time1 = a + b` and `time2 = a + c` where `a` is the `memcpy` time and `b` and `c` are the SSE and scalar times. Then `time1 - time2 = b - c ` so the `a` cancels out to give me the absolute time difference between the SSE and scalar versions. But what if I want e.g. `(b-c)/b`? – Z boson Mar 29 '16 at 07:54
  • @PeterCordes, I guess I could measure the `a` separately. – Z boson Mar 29 '16 at 08:01
  • @Zboson: Yes, I know you'd be timing `memcpy`. Subtract out the time for *just* the memcpy to get an approximation. Or rewrite your functions to not operate in-place. The scalar version uses a [`cmov` and keeps `a[i]` in a register across iterations if the store is unconditional](https://godbolt.org/g/1qg2XX), which is better anyway unless you're expecting no stores. The always-store version is one-load one-store, instead of two loads and one optional store per clock, which may help on pre-SnB hardware with only one load port. (but it won't sustain one iteration per clock anyway). – Peter Cordes Mar 29 '16 at 08:07
  • @PeterCordes, your Core2 system would fine if it has SSE4.1. Even if it does `_mm_shuffle_epi32` is only used one per iteration which you could emulate with a few more instructions. – Z boson Mar 29 '16 at 08:24
  • It's a Conroe, so it's SSSE3 with slow shuffles. If the SSE loop doesn't fit in 64B, then decode will probably be the bottleneck, since it the loop buffer is before the decoders (after pre-decode insn-length marking). Too different to be worth spending time on, and worse perf counters. Anyway, I made some [not-in-place scalar versions that all compile essentially the same](https://godbolt.org/g/w9517t). They'll be slower than the numbers you were getting for the original scalar version, because those original numbers are for the case where there's no work and the branch predicts perfectly. – Peter Cordes Mar 29 '16 at 08:32
  • @PeterCordes, I tried your `bar` function on Coliru and the SSE version is only twice as fast now. On my Haswell system it's about 66% faster so it's more in the ballpark now. I'll try your `memcpy` idea soon. – Z boson Mar 29 '16 at 08:34