Efficient stable sum of a sorted array in AVX2

Question

Consider a sorted (ascending) array of double numbers. For numerical stability the array should be summed up as if iterating it from the beginning till the end, accumulating the sum in some variable.

How to vectorize this efficiently with AVX2?

I've looked into this method Fastest way to do horizontal vector sum with AVX instructions , but it seems quite tricky to scale it to an array (some divide&conquer approach may be needed), while keeping the floating-point precision by ensuring that small numbers are summed up before adding them to a larger number.

Clarification 1: I think it should be ok to e.g. sum the first 4 items, then add them to the sum of the next 4 items, etc. I'm willing to trade some stability for performance. But I would prefer a method that doesn't ruin the stability completely.

Clarification 2: memory shouldn't be a bottleneck because the array is in L3 cache (but not in L1/L2 cache, because pieces of the array were populated from different threads). I wouldn't like to resort to Kahan summation because I think it's really the number of operations that matters, and Kahan summation would increase it about 4 times.

Doesn't summing in ascending order kill every room for parallelization? Since FP arithmetic is not associative and you are talking explicitly about numeric stability, I believe you want a serial sum from the first to the last item. Not my DV btw. — Margaret Bloom, Sep 08 '17 at 15:35
@MargaretBloom, I've updated the question with a clarification for this. — Serge Rogatch, Sep 08 '17 at 15:45
Hmm, if you're willing to trade some stability, then a simple parallel sum works, doesn't it? You sum d[0]+d[4]+d[8]+.., d[1]+d[5]+d[9]+..., etc. — geza, Sep 08 '17 at 16:27
@geza, that seems too much loss of precision: it's too bad to add distant numbers. — Serge Rogatch, Sep 08 '17 at 16:48
@SergeRogatch: Supposedly you have a lot of numbers (or else there's no need to use SIMD). In this case, the numbers added should not be too distant. I think that you should actually check out the result of my recommendation, to see that what amout of precision you lose. — geza, Sep 08 '17 at 16:56
@geza, they are `double` numbers for the whole range of exponents (2046 or so). Each item in the array is a bucket by exponent. So the numbers with distance 4 array items differ in exponent by 4 (therefore 16 times). What you offer is a systematic loss of 2 bits of precision. — Serge Rogatch, Sep 08 '17 at 17:00
@SergeRogatch: I see. I thought that you have much more numbers than several hundreds. But in this case you can just sum the last 50-60 numbers, and the result will be the same. — geza, Sep 08 '17 at 17:02

Peter Cordes · Answer 1 · 2022-10-30T03:39:35.503

If you need precision and parallelism, use Kahan summation or another error-compensation technique to let you reorder your sum (into SIMD vector element strides with multiple accumulators).

As Twofold fast summation - Evgeny Latkin points out, if you bottleneck on memory bandwidth, an error-compensated sum isn't much slower than a max-performance sum, since the CPU has lots of computation throughput that goes unused in a simply-parallelized sum that bottlenecks on memory bandwidth

See also (google results for kahan summation avx)

https://github.com/rreusser/summation-algorithms
https://scicomp.stackexchange.com/questions/10869/which-algorithm-is-more-accurate-for-computing-the-sum-of-a-sorted-array-of-numb
Is this way of processing the tail of an array with SSE overkill? has a sample implementation of SSE Kahan, not unrolled, and a comparison of actual error with it (no error) vs. sequential sum (bad) vs. simple SIMD sum (much less total error), showing that just vectorizing (and/or unrolling) with multiple accumulators tends to help accuracy.

Re: your idea: Summing groups of 4 numbers in-order will let you hide the FP-add latency, and bottleneck on scalar add throughput.

Doing horizontal sums within vectors takes a lot of shuffling, so it's a potential bottleneck. You might consider loading a0 a1 a2 a3, then shuffling to get a0+a1 x a2+a3 x, then (a0+a1) + (a2+a3). You have a Ryzen, right? The last step is just a vextractf128 down to 128b. That's still 3 total ADD uops, and 3 shuffle uops, but with fewer instructions than scalar or 128b vectors.

Your idea is very similar to Pairwise Summation

You're always going to get some rounding error, but adding numbers of similar magnitude minimizes it.

See also Simd matmul program gives different numerical results for some comments about Pairwise Summation and simple efficient SIMD.

The difference between adding 4 contiguous numbers vs. vertically adding 4 SIMD vectors is probably negligible. SIMD vectors give you small strides (of SIMD vector width) in the array. Unless the array grows extremely quickly, they're still going to have basically similar magnitudes.

You don't need to horizontal sum until the very end to still get most of the benefit. You can maintain 1 or 2 SIMD vector accumulators while you use more SIMD registers to sum short runs (of maybe 4 or 8 SIMD vectors) before adding into the main accumulators.

In fact having your total split more ways (across the SIMD vector elements) means it doesn't grow as large. So it helps with exactly the problem you're trying to avoid, and horizontal summing down to a single scalar accumulator actually makes things worse, especially for a strictly sorted array.

With out-of-order execution, you don't need very many tmp accumulators to make this work and hide the FP-add latency of accumulating into the main accumulators. You can do a couple groups of accumulating into a fresh tmp = _mm_load_ps() vector and adding that to the total, and OoO exec will overlap those executions. So you don't need a huge unroll factor for your main loop.

But it shouldn't be too small, you don't want to bottleneck on the add latency into the main accumulator, waiting for the previous add to produce a result before the next one can start. You want to bottleneck on FP-add throughput. (Or if you care about Broadwell/Haswell and you don't totally bottleneck on memory bandwidth, mix in some FMA with a 1.0 multiplier to take advantage of that throughput.)

e.g. Skylake SIMD FP add has 4 cycle latency, 0.5 cycle throughput, so you need to be doing at least 7 adds that are part of a short dep chain for every add into a single accumulator. Preferably more, and/or preferably with 2 long-term accumulators to better absorb bubbles in scheduling from resource conflicts.

See _mm256_fmadd_ps is slower than _mm256_mul_ps + _mm256_add_ps? for more about multiple accumulators.

@SergeRogatch: Have a look at that paper I linked first. There may be options faster than Kahan but numerically better than plain multi-accumulator SIMD. Ryzen only has about half the FP throughput of an Intel CPU, so yeah, you could easily be limited by FP throughput with data coming from L3. Some kind of pairwise summation is probably good. With well-chosen shuffles, you can probably get good results. — Peter Cordes, Sep 08 '17 at 18:32

score 0 · Answer 2 · answered Sep 09 '17 at 10:01

Here's my solution so far:

double SumVects(const __m256d* pv, size_t n) {
  if(n == 0) return 0.0;
  __m256d sum = pv[0];
  if(n == 1) {
    sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
  } else {
    for(size_t i=1; i+1 < n; i++) {
      sum = _mm256_hadd_pd(sum, pv[i]);
      sum = _mm256_permute4x64_pd(sum, _MM_SHUFFLE(3, 1, 2, 0));
    }
    sum = _mm256_hadd_pd(sum, pv[n-1]);
  }
  const __m128d laneSums = _mm_hadd_pd(_mm256_extractf128_pd(sum, 1),
    _mm256_castpd256_pd128(sum));
  return laneSums.m128d_f64[0] + laneSums.m128d_f64[1];
}

Some explanation: it adds adjacent double array items first, such as a[0]+a[1], a[2]+a[3], etc. Then it adds the sums of adjacent items.

Some of the stuff I googled mentioned "pairwise summation". That may be the same thing as what you're doing here. Since you need another shuffle after the `hadd`, can you get it done with manual shuffling to set up for a vertical `add` in a different order? Maybe make a Ryzen version that does more 128b-vector stuff with still a bit of 256b for more front-end uop throughput? — Peter Cordes, Sep 10 '17 at 21:28

score 0 · Answer 3 · answered Sep 09 '17 at 22:06

0

The games you want to play are likely counterproductive. Try experimenting by generating a bunch of iid samples from your favourite distribution, sorting them, and comparing "sum in increasing order" with "sum each lane in increasing order, then sum the lane sums."

For 4 lanes and 16 data, summing lanewise gives me smaller error about 28% of the time while summing in increasing order gives me smaller error about 17% of the time; for 4 lanes and 256 data, summing lanewise gives me smaller error about 68% of the time while summing in increasing order gives me smaller error about 12% of the time. Summing lanewise also beats the algorithm you gave in your self-answer. I used a uniform distribution on [0,1] for this.

answered Sep 09 '17 at 22:06

tmyklebu

13,915
3
28
57

1

That's an interesting observation, but could you post source code for this. It's not clear how you calculated the error. – Serge Rogatch Sep 10 '17 at 06:36
Did you compare with Kahan summation or something to find the error? Or did you do this for `float` and get the no-error sum with `double` or extended-precision? Or something else? – Peter Cordes Sep 10 '17 at 21:24
1

@SergeRogatch: The error in summing a bunch of `double`s is the difference between the computed result and the exact result. It is straightforward to compute the exact result using a library like mpfr. – tmyklebu Sep 10 '17 at 21:35

Efficient stable sum of a sorted array in AVX2

3 Answers3

Your idea is very similar to Pairwise Summation

Linked