parallel prefix (cumulative) sum with SSE

Question

I'm looking for some advice on how to do a parallel prefix sum with SSE. I'm interested in doing this on an array of ints, floats, or doubles.

I have come up with two solutions. A special case and a general case. In both cases the solution runs over the array in two passes in parallel with OpenMP. For the special case I use SSE on both passes. For the general case I use it only on the second pass.

My main question is how I can use SSE on the first pass in the general case? The following link simd-prefix-sum-on-intel-cpu show an improvement for bytes but not for 32bit data types.

The reason the special case is called special is that it requires the array be in a special format. For example let's assume there were only 16 elements of an arrayaof floats. Then if the array was rearranged like this (array of structs to struct of arrays):

a[0] a[1] ...a[15] -> a[0] a[4] a[8] a[12] a[1] a[5] a[9] a[13]...a[3] a[7] a[11] a[15]

SSE vertical sums could be used on both passes. However, this would only be efficient if the arrays were already in the special format and the output could be used in the special format. Otherwise expensive rearrange would have to be done on both input and output which would make it much slower than the general case.

Maybe I should consider a different algorithm for the prefix sum (e.g. a binary tree)?

Code for the general case:

void prefix_sum_omp_sse(double a[], double s[], int n) {
    double *suma;
    #pragma omp parallel
    {
        const int ithread = omp_get_thread_num();
        const int nthreads = omp_get_num_threads();
        #pragma omp single
        {
            suma = new double[nthreads + 1];
            suma[0] = 0;
        }
        double sum = 0;
        #pragma omp for schedule(static) nowait //first parallel pass
        for (int i = 0; i<n; i++) {
            sum += a[i];
            s[i] = sum;
        }
        suma[ithread + 1] = sum;
        #pragma omp barrier
        #pragma omp single
        {
            double tmp = 0;
            for (int i = 0; i<(nthreads + 1); i++) {
                tmp += suma[i];
                suma[i] = tmp;
            }
        }
        __m128d offset = _mm_set1_pd(suma[ithread]);
        #pragma omp for schedule(static) //second parallel pass with SSE as well
        for (int i = 0; i<n/4; i++) {       
            __m128d tmp1 = _mm_load_pd(&s[4*i]);
            tmp1 = _mm_add_pd(tmp1, offset);    
            __m128d tmp2 = _mm_load_pd(&s[4*i+2]);
            tmp2 = _mm_add_pd(tmp2, offset);
            _mm_store_pd(&s[4*i], tmp1);
            _mm_store_pd(&s[4*i+2], tmp2);
        }
    }
    delete[] suma;
}

I though compiler like gcc/icc can do auto-vectorization for the second part, so that you don't need to use SIMD intrinsics. Do you get performance improvement, compare to the plain c code with some compiler options like `-msse2` — kangshiyin, Oct 21 '13 at 12:25
They might. I rand this on MSVC2013. It does not auto-vectorize the second pass. My experience with MSVC is that when you use OpenMP you have to do the vectorization yourself. I don't think any of them will unroll the loop with the SSE code for you but it does not help in this case anyway. — Z boson, Oct 21 '13 at 12:28
In response to the question on performance, the general code I posted is over 3 times faster than the sequential code in release mode with AVX enabled on my 4 core ivy bridge system. The time cost should be `n/ncores*(1+1/SIMD_width)`. So for 4 cores and SIMD_width=2 (double) that should be 3n/8. That's about a 2.7 times speed up. Hyper-threading helps a bit so I guess that's pushing it over 3 (I'm using 8 threads. When I try 4 threads the performance drops a bit). — Z boson, Oct 21 '13 at 13:39
You might want to mention that the input and output arrays need to be 16-byte aligned due to the use of `_mm_load_ps`, but a `float *` will in the general case only be 4-byte aligned. — BeeOnRope, Aug 13 '19 at 19:49
Related: https://www.intel.com/content/www/us/en/developer/articles/technical/optimize-scan-operations-explicit-vectorization.html is an Intel article on doing it with AVX-512, manual vectorization or `#pragma omp scan inclusive(varname)` which ICC supports. Unrolling over multiple vectors allows you more room to hide latency (especially FP latency) and create instruction-level parallelism with your shuffling. — Peter Cordes, Jul 18 '22 at 18:41

score 19 · Accepted Answer · edited May 23 '17 at 12:33

This is the first time I'm answering my own question but it seems appropriate. Based on hirschhornsalz answer for prefix sum on 16 bytes simd-prefix-sum-on-intel-cpu I have come up with a solution for using SIMD on the first pass for 4, 8, and 16 32-bit words.

The general theory goes as follows. For a sequential scan of n words it takes n additions (n-1 to scan the n words and one more addition carried from the previous set of words scanned). However using SIMD n words can be scanned in log₂(n) additions and an equal number of shifts plus one more addition and broadcast to carry from the previous SIMD scan. So for some value of n the SIMD method will win.

Let's look at 32-bit words with SSE, AVX, and AVX-512:

4 32-bit words (SSE):      2 shifts, 3 adds, 1 broadcast       sequential: 4 adds
8 32-bit words (AVX):      3 shifts, 4 adds, 1 broadcast       sequential: 8 adds
16 32 bit-words (AVX-512): 4 shifts, 5 adds, 1 broadcast       sequential: 16 adds

Based on that it appears SIMD won't be useful for a scan for 32-bit words until AVX-512. This also assumes that the shifts and broadcast can be done in only 1 instruction. This is true for SSE but not for AVX and maybe not even for AVX2.

In any case I put together some working and tested code which does a prefix sum using SSE.

inline __m128 scan_SSE(__m128 x) {
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 4))); 
    x = _mm_add_ps(x, _mm_castsi128_ps(_mm_slli_si128(_mm_castps_si128(x), 8)));
    return x;
}

void prefix_sum_SSE(float *a, float *s, const int n) {
__m128 offset = _mm_setzero_ps();
for (int i = 0; i < n; i+=4) {
    __m128 x = _mm_load_ps(&a[i]);
    __m128 out = scan_SSE(x);
    out = _mm_add_ps(out, offset);
    _mm_store_ps(&s[i], out);
    offset = _mm_shuffle_ps(out, out, _MM_SHUFFLE(3, 3, 3, 3)); 
}

Notice that the scan_SSE function has two additions (_mm_add_ps) and two shifts (_mm_slli_si128). The casts are only used to make the compiler happy and don't get converted to instructions. Then inside the main loop over the array in prefix_sum_SSE another addition and one shuffle is used. That's 6 operations total compared to only 4 additions with the sequential sum.

Here is a working solution for AVX:

inline __m256 scan_AVX(__m256 x) {
    __m256 t0, t1;
    //shift1_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(2, 1, 0, 3));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x11));
    //shift2_AVX + add
    t0 = _mm256_permute_ps(x, _MM_SHUFFLE(1, 0, 3, 2));
    t1 = _mm256_permute2f128_ps(t0, t0, 41);
    x = _mm256_add_ps(x, _mm256_blend_ps(t0, t1, 0x33));
    //shift3_AVX + add
    x = _mm256_add_ps(x,_mm256_permute2f128_ps(x, x, 41));
    return x;
}

void prefix_sum_AVX(float *a, float *s, const int n) {
    __m256 offset = _mm256_setzero_ps();
    for (int i = 0; i < n; i += 8) {
        __m256 x = _mm256_loadu_ps(&a[i]);
        __m256 out = scan_AVX(x);
        out = _mm256_add_ps(out, offset);
        _mm256_storeu_ps(&s[i], out);
        //broadcast last element
        __m256 t0 = _mm256_permute2f128_ps(out, out, 0x11);
        offset = _mm256_permute_ps(t0, 0xff);
    }   
}

The three shifts need 7 intrinsics. The broadcast needs 2 intrinsics. So with the 4 additions that's 13 intrinisics. For AVX2 only 5 intrinsics are needed for the shifts so 11 intrinsics total. The sequential sum only needs 8 additions. Therefore likely neither AVX nor AVX2 will be useful for the first pass.

Edit:

So I finally benchmarked this and the results are unexpected. The SSE and AVX code are both about twice as fast as the following sequential code:

void scan(float a[], float s[], int n) {
    float sum = 0;
    for (int i = 0; i<n; i++) {
        sum += a[i];
        s[i] = sum;
    }
}

I guess this is due to instruction level paralellism.

So that answers my own question. I succeeded in using SIMD for pass1 in the general case. When I combine this with OpenMP on my 4 core ivy bridge system the total speed up is about seven for 512k floats.

I bet you'd get less speedup with integers. FP add has 3 cycle latency (4 on Skylake), which is the limiting factor for the simple sequential loop. The sequential integer loop should sustain one store per clock, because that's the bottleneck. There's also a parallel algorithm which doesn't lend itself to SIMD very well (linked on the other question already, I see). http://http.developer.nvidia.com/GPUGems3/gpugems3_ch39.html. I was thinking about starting to apply their first step with SIMD vectors, using PHADD. (One of the rare uses for PHADD with two different args!) — Peter Cordes, Sep 10 '15 at 04:08
@PeterCordes - I measured the speedup with integers: about 0.75 cycles/uint32_t versus 1.00 theoretical best for scalar (unless you try some SWAR stuff in scalar to get down to 1 store per 2 elems). So yeah, the speedup is a lot less, but still beats scalar. — BeeOnRope, Aug 15 '19 at 05:06
@BeeOnRope: Was randomly thinking about this: would it help to consider a chunk of 2 or 4 vectors worth of data together? You can sum them vertically and do one hsum to get a starting point for the next chunk, creating ILP across chunks for the code that sorts out the results within a chunk. — Peter Cordes, Apr 04 '22 at 22:27
@PeterCordes - ha I have a 50% done blog post sitting in purgatory since 2019 (when I wrote that) which looks at stuff like this. IIRC the best was exactly something like this: doing 1 vector has a lot of p5 pressure on Skylake-ish (though you can move some work to p01), so you only need to overlap something like 2 adjacent vectors to get to port limited. One way is to as you suggest, another way is 2 passes (e.g., do the prefix for 1 vector then a second pass using the highest element in each vector to calculate the full sum and update each vector-sized chunk. — BeeOnRope, Apr 04 '22 at 23:05
@BeeOnRope: Z boson's other answer on the subject ([SIMD prefix sum on Intel cpu](https://stackoverflow.com/a/19496697)) shows a 2-pass way; helpful for threading. Within a single thread, if using multiple passes you'd want to cache-block it into maybe 4k or 8k chunks to get L1d hits. Maybe even smaller to allow more overlap between phases, if that doesn't cost too much in branch misses... — Peter Cordes, Apr 05 '22 at 02:55

parallel prefix (cumulative) sum with SSE

1 Answers1

Linked