Efficient stable sum of ordered numbers

Question

I have a quite long list of floating point positive numbers (std::vector<float>, size ~1000). The numbers are sorted in decreasing ordering. If I sum them following the order:

for (auto v : vec) { sum += v; }

I guess I can have some numerical stability problem, since close to the end of the vector sum will be much larger than v. The easiest solution would be to traverse the vector in reverse order. My question is: is that efficient as well as the forward case? I will have more cache missing?

Is there any other smart solution?

Not quite a duplicate, but very similar question: [sum of series using float](https://stackoverflow.com/q/12272977/1332041) — acraig5075, Nov 13 '19 at 13:44
An example that demonstrates: https://coliru.stacked-crooked.com/a/3939eb8830558450 — acraig5075, Nov 13 '19 at 13:52
If you actually care about precision to high degrees, check out [Kahan summation](https://en.wikipedia.org/wiki/Kahan_summation_algorithm). — Max Langhof, Nov 13 '19 at 13:57
Related: https://stackoverflow.com/questions/1950878/c-for-loop-indexing-is-forward-indexing-faster-in-new-cpus — Ruggero Turra, Nov 14 '19 at 13:40

Davide Spataro · Answer 1 · 2019-11-13T13:57:56.653

5

I bench-marked your use case and the results (see attached image) point to the direction that it does not make any performance difference to loop forward or backward.

You might want to measure on your hardware + compiler as well.

Using STL to perform the sum it's as fast as manual looping over data but much more expressive.

use the following for reverse accumulation:

std::accumulate(rbegin(data), rend(data), 0.0f);

while for forward accumulation:

std::accumulate(begin(data), end(data), 0.0f);

edited Nov 13 '19 at 13:57

answered Nov 13 '19 at 13:52

Davide Spataro

7,319
1
24
36

that website is super cool. Just to be sure: you are not timing the random generation, right? – Ruggero Turra Nov 13 '19 at 13:56
No, only the part in the `state` loop is timed. – Davide Spataro Nov 13 '19 at 14:01

score 3 · Accepted Answer · answered Nov 13 '19 at 13:52

I guess I can have some numerical stability problem

So test for it. Currently you have a hypothetical problem, which is to say, no problem at all.

If you test, and the hypothetical materializes into an actual problem, then you should worry about actually fixing it.

That is - floating-point precision can cause issues, but you can confirm whether it really does for your data, before prioritizing that over everything else.

... I will have more cache missing?

One thousand floats is 4Kb - it'll fit in cache on a modern mass-market system (if you have another platform in mind, tell us what it is).

The only risk is that the prefetcher won't help you when iterating backwards, but of course your vector may already be in cache. You can't really determine this until you profile in the context of your full program, so there's no use worrying about it until you have a full program.

Is there any other smart solution?

Don't worry about things that might become problems, until they actually become problems. At most it's worth noting possible issues, and structuring your code so that you can replace the simplest-possible solution with a carefully-optimized one later, without re-writing everything else.

Malov Vladimir · Answer 3 · 2019-11-13T14:24:55.270

2

For this purpose you can use reverse iterator without any transpositions in your std::vector<float> vec:

float sum{0.f};
for (auto rIt = vec.rbegin(); rIt!= vec.rend(); ++rIt)
{
    sum += *rit;
}

Or do the same job using standard algortithm:

float sum = std::accumulate(vec.crbegin(), vec.crend(), 0.f);

Performance must be the same, changed only bypass direction of your vector

edited Nov 13 '19 at 14:24

answered Nov 13 '19 at 13:39

Malov Vladimir

490
4
11

Correct me if I'm wrong, but I think this is even more efficient than the foreach statement OP is using, as it introduces an overhead. YSC is right about the numerical stability part, tho. – sephiroth Nov 13 '19 at 13:47
4

@sephiroth No, any half-decent compiler won't really care whether you wrote a range-for or an iterator for. – Max Langhof Nov 13 '19 at 13:48
1

Real-world performance is decidedly not guaranteed to be the same, due to caches/prefetching. It's reasonable for the OP to be wary of that. – Max Langhof Nov 13 '19 at 13:49

YSC · Answer 4 · 2019-11-13T13:53:03.427

2

The easiest solution would be to traverse the vector in reverse order. My question is: is that efficient as well as the forward case? I will have more cache missing?

Yes it is efficient. Branch prediction and smart cache strategy from your hardware is tuned for sequential access. You can safely accumulate your vector:

#include <numeric>

auto const sum = std::accumulate(crbegin(v), crend(v), 0.f);

edited Nov 13 '19 at 13:53

answered Nov 13 '19 at 13:42

YSC

38,212
9
96
149

2

Can you clarify: in this context "sequential access" means forward, backward, or both? – Ruggero Turra Nov 13 '19 at 13:54
1

@RuggeroTurra I cannot unless I can find a source, and I'm not in the mood to read CPU datasheets right now. – YSC Nov 13 '19 at 14:01
@RuggeroTurra Usually sequential access would mean forwards. All semi-decent memory prefetchers catch forward sequential access. – Toothbrush Nov 13 '19 at 14:32
@Toothbrush, thanks. So, if I loop backward, in principle, it can be a performance issue – Ruggero Turra Nov 13 '19 at 14:38
In principle, on at least some hardware, if the whole vector is not _already_ in L1 cache. – Useless Nov 13 '19 at 16:40
@RuggeroTurra Yes, as `@Useless` says, at least on some hardware. However, unless you have a very large dataset, it will usually fit in the caches and so you won't really notice it. – Toothbrush Nov 14 '19 at 13:55

Paul Floyd · Answer 5 · 2020-01-30T14:09:31.340

If by numerical stability you mean accuracy, then yes, you may end up with accuracy issues. Depending on the ratio of the largest to the smallest values, and your requirments for accuracy in the result, this may or may not be a problem.

If you do want to have high accuracy, then consider Kahan summation - this uses an extra float for error compensation. There is also pairwise summation.

For detailed analysis of the tradeoff between accuracy and time, see this article.

UPDATE for C++17:

A few of the other answers mention std::accumulate. Since C++17 there are execution policies which allow algorithms to be parallelized.

For instance

#include <vector>
#include <execution>
#include <iostream>
#include <numeric>

int main()
{  
   std::vector<double> input{0.1, 0.9, 0.2, 0.8, 0.3, 0.7, 0.4, 0.6, 0.5};

   double reduceResult = std::reduce(std::execution::par, std::begin(input), std::end(input));

   std:: cout << "reduceResult " << reduceResult << '\n';
}

This should make summing large datasets faster at the cost of nondeterministic rounding errors (I'm assuming that the user won't be able to determine the thread partitioning).

Efficient stable sum of ordered numbers

5 Answers5