Fastest way to sum array of float values

Question

I have doing DSP coding using Visual Studio and C++.

I have an array of floats, only 8 right now but may be changed later to more or less, I need to sum to a single float variable, and then average.

I would like to use intrinsic instructions, which I have no experience with and why I am asking here.

All that is required is that the code is faster than what I got below, and it will work on Intel and AMD processor's say within the past 5 years.

Note that all the array float values are within -1 and 1, and speed is more important than precision.

float sum = (sampleValue[0] + sampleValue[1] + sampleValue[2] + sampleValue[3] +
             sampleValue[4] + sampleValue[5] + sampleValue[6] + sampleValue[7]) / 8;

I apologize if this question has already been answered, and if so please direct me to the answer, thanks.

Also if somebody can direct me to "Intrinsic functions for dummies" online article/tutorial it would be much appreciated, thanks!

I would use [std::reduce](https://en.cppreference.com/w/cpp/algorithm/reduce). It is probably implemented in an efficient way using fast instructions available in your CPU. — pptaszni, Mar 02 '21 at 16:37
std::reduce does not seem available in my compiler, although std::accumulate is. I am not quite sure how to use it though. — DKDiveDude, Mar 02 '21 at 16:45
`float result = std::accumulate(sampleValue.begin(), sampleValue.end(), 0.0);` — pptaszni, Mar 02 '21 at 16:47
Ouch now I have to convert a lot of my code from standard to vector arrays. — DKDiveDude, Mar 02 '21 at 16:57
Also for simplicity's sake, I did not mention that my sample array is actually multi-dimensional, and I don't see how to declare a multi-dimensional array using std:vector — DKDiveDude, Mar 02 '21 at 17:09
2D vector declared like this: std::vector< std::vector< float > > — ravenspoint, Mar 02 '21 at 17:10
@DKDiveDude `std::reduce` is available from C++17, you might need to tell your compiler to use C++17. — Lukas-T, Mar 02 '21 at 17:11
@ravenspoint That's suboptimal, because it contains two levels of indirection. Storing the data in a single `std::vector` and calculating the appropriate offsets should be faster, because it's more cache friendly. — Lukas-T, Mar 02 '21 at 17:20
@DKDiveDude: `std::accumulate` will work on any iterator pair, and pointers are iterators too. No need to change from `float[]`. — MSalters, Mar 02 '21 at 17:24
The linked answer is a bit tricky. A single 8-value add is not going to be a performance bottleneck. If you do many, then you should first consider how your data is organized. — MSalters, Mar 02 '21 at 17:28
@MSalters Perhaps not a bottleneck, however the operation is performed anywhere from 48000 per second (1 unison voice) to 768000 per second (16 unison voices). — DKDiveDude, Mar 02 '21 at 17:54
@pptaszni: you'd normally want to use `0.0f`, not `0.0`, so the accumulation type is `float` not `double`, not forcing the compiler to convert float to double on the fly. But to let the compiler vectorize (by starting with `addps` to reduce 8 to 4 elements), you'd either need a "fast-math" option or give it permission for this specific loop. — Peter Cordes, Mar 03 '21 at 03:03
@Peter Cordes I would love to try and get std::accumulate to work, however I have not been unable to figure out how to use with a multi-dimensional "standard" C type array, as in sampleValue[8][16][8], either one of the [8] could be configured to hold the values that needs to be summed and averaged. — DKDiveDude, Mar 03 '21 at 18:46
`sampleValue[z][y][0..7]` should be easy because `sampleValue[8][16]` is a simple array of 8 floats (contiguous). `sampleValue[0..7][y][x]` is not contiguous, the floats are separated by a large stride and access to them through that array should use that expression every time. So you'd need a custom iterator to use with std::accumulate. You generally want to avoid doing that for locality reasons, and also SIMD reasons. (The only thing worse would be if that's an array of pointers-to-pointers, not a true multidimensional C array.) — Peter Cordes, Mar 04 '21 at 02:40
For tiny arrays, I don't expect `std::accumulate` will buy you anything. I don't think it helps you benefit from SIMD, unless possibly the "unsequenced" execution policy allows SIMD as well as threading. (You don't *want* it to use threads for a tiny array, but you do need the compiler to do something other than `total = a[0] + a[1] + a[2] + ...` with strict FP eval order semantics to get optimal use of SIMD) — Peter Cordes, Mar 04 '21 at 02:42

score 1 · Answer 1 · answered Mar 02 '21 at 17:26

I assume you are thinking of SIMD (single instruction multiple data) operations.

Searching for "SIMD intrinsics" will get you plenty of resources, but here's a nice starter one: https://stackoverflow.blog/2020/07/08/improving-performance-with-simd-intrinsics-in-three-use-cases/

This article also is closer to your use-case: http://blog.zachbjornson.com/2019/08/11/fast-float-summation.html

ravenspoint · Answer 2 · 2021-03-02T17:50:34.640

0

Suggest using a pointer

float sum = 0;
float* p = sampleValue;
for( int k = 0; k< 8; k++ )
    sum += *p++;

edited Mar 02 '21 at 17:50

answered Mar 02 '21 at 16:58

ravenspoint

19,093
6
57
103

Thanks for chiming in, but are there not an intrinsic vector type function that can sum an entire array? – DKDiveDude Mar 02 '21 at 17:01
Under the covers, any such function will likely use this code but with the overhead of a function call - possibly compiler optimized away. – ravenspoint Mar 02 '21 at 17:03
Tested and counted high res ticks inside loop that performs this operation. My own simple version was 9% faster. – DKDiveDude Mar 02 '21 at 17:47

Fastest way to sum array of float values

2 Answers2