How do I take the average of a large floating point array (100.000+ values) precisely? Ideally utilizing SIMD/AVX instructions. Array pointer in rdi; size of array in rsi.
-
Existing code..? What’s “not precise” about a basic sum + divide? Is relevant information being lost (it can be) with these values? – user2864740 Jul 14 '20 at 02:45
-
The naive idea would be to simply add all values into a register and divide at the end. This obviously causes a large loss in precision. A second idea would be adding and dividing in pairs of 2 floats every time. A third idea would be a Kahan summation. – Valentin Metz Jul 14 '20 at 02:48
-
A fourth idea, if your array type is `float` (the OP doesn't say specifically), would be to use a `double` accumulator to avoid the loss in precision from the sum over many values. – Jason R Jul 14 '20 at 02:50
-
https://stackoverflow.com/q/1930454/2864740 - for some approaches. Although, if the easier method “works well enough”.. – user2864740 Jul 14 '20 at 02:50
-
Using double would work for floats, but I'm looking for a more general approach. (I want to implement a version for doubles as well) If I'm correct the Kahan summation does something similar, but with a lower space requirement and therefore higher AVX efficiency. – Valentin Metz Jul 14 '20 at 02:52
-
The "iterative" approach you linked seems viable as well. I can implement that with SIMD for multiple values at once. – Valentin Metz Jul 14 '20 at 03:02
-
1At the moment I'm trying to figure out what would be more precise - an iterative add or a Kahan sum. Both seem like viable approaches. The values I work with are not expected to be on any extreme side, but I do not have a "feel" for the numbers. The algorithm this will be employed in will estimate a "magic number" for calculations. – Valentin Metz Jul 14 '20 at 03:19
-
1Note that a plain sum unrolled with multiple SIMD vector accumulators is a step in the direction of pairwise summation. I think you want to maximize accuracy even at the cost of speed, but perhaps other future readers would consider that (because it's also the fastest way to sum an array at all, and generally *more* accurate than naive scalar). [Simd matmul program gives different numerical results](//stackoverflow.com/q/55477701). Also [Efficient stable sum of a sorted array in AVX2](//stackoverflow.com/q/46119811) has some links to SIMD Kahan and other summation algorithms. – Peter Cordes Jul 14 '20 at 04:57
-
@user2864740 When I read the word "precisely", I was thinking that this is almost impossible. Think about the following array: `{ +1E+100, 6.0, -1E+100 }`. If we calculate `((+1E+100) + 6.0 + (-1E+100)) / 3`, we get the result `0` due to the limited precision of floating point. However, if we sum up the elements in a different order: `((+1E+100) + (-1E+100) + 6.0) / 3`, we get the result `2.0`, which is the mathematically exact result. – Martin Rosenau Jul 14 '20 at 05:13
-
@MartinRosenau: Re “almost impossible”: Of course it is not “almost impossible,” aside from confusion about what “almost” means in this use. Can something be 99% impossible? By what metric can something be “almost” impossible. Aside from that, of course it is not impossible at all. There is an exact result, and it can be obtained using a sufficient number of bits or pencil and paper using techniques taught in elementary school. The problem is entirely one of convenience and efficiency, not possibility. – Eric Postpischil Jul 14 '20 at 11:59
-
The question should be clarified to specify what you mean by “precisely”. Does the result need to be the mathematical mean rounded to the nearest representable value? Or is it okay if the result is the mathematical sum rounded to the nearest representable value and then divided by the number of elements and again rounded to the nearest representable value? Or is more error than that okay? How much? Also, if there are any constraints known on the data, such as that it is all positive or all within a certain magnitude, these should be stated. – Eric Postpischil Jul 14 '20 at 12:02
-
1@EricPostpischil I didn't write that it **is** "almost impossible", but that I was **thinking** that in the first moment after reading the question. And "almost impossible" meant that in the first moment I doubted that an algorithm exists that provides the correct result in all cases: In the example we have an error of 100% when we sum up the elements in the original order. The example `{ +1E+100, 16, -1E+100, -12}` would even result in an error of 300%. And 300% error is definitely not "precise". – Martin Rosenau Jul 14 '20 at 14:02
-
@MartinRosenau: There is absolutely an algorithm that produces the correct answer in all cases. It is taught to elementary school students. You write out the numbers and add/subtract them column by column, carrying or borrowing as needed. You may be restricting your consideration to single floating-point add/subtract instructions, but a solution does not need to be so limited. – Eric Postpischil Jul 14 '20 at 14:16
-
@EricPostpischil Surely. This is more or less the algorithm using fixed-point that I proposed in my answer. However, this is not "simply adding" and "simply dividing" numbers: The actual difficulties for a programmer using this approach are converting floating- and fixed-point values and performing large (e.g. 2126-bit) integer operations. – Martin Rosenau Jul 14 '20 at 14:35
-
1To clarify what I mean by precise: If could add all floats with infinite precision, then divide and then reduce the result to a 32bit float again, we'd have what I'd define as "precise" average of all floats. – Valentin Metz Jul 14 '20 at 16:41
-
Summing 100.000 values: The difference between infinite precision math and then rounding to [float](https://stackoverflow.com/questions/62887152/how-do-i-take-the-average-of-a-large-floating-point-array-precisely/62939983#comment111231356_62887152) and using `double` math, then rounding to `float` is statistically nil. – chux - Reinstate Monica Jul 17 '20 at 20:40
3 Answers
precisely
If precision is more important than speed:
Using floating-point arithmetic you will probably always have a loss of precision.
However, you can calculate the exact value if you use fixed-point arithmetic:
All floating-point values can be expressed as the product of some constant (which is typical for the data type used) and a large signed integer value.
In the case of double
, each value can be expressed as product of a constant typical for the double
data type and a 2102-bit signed integer.
If your array has 10 million elements, the sum of all elements can be expressed as product of that constant multiplied with a 2126-bit signed integer. (Because 10 million fits into 24 bits and 2102 + 24 = 2026.)
You can use the same methods that are used to do 32-bit integer arithmetic on an 8-bit CPU to perform 2126-bit integer arithmetic on a 64-bit CPU.
Instead of adding up all floating-point values itself, you add up the 2102-bit integers representing each floating point value (here lsint
is a signed data type that can handle 2126-bit integers):
void addNumber(lsint * sum, double d)
{
uint64 di = *(uint64 *)&d;
lsint tmp;
int ex = (di>>52)&0x7FF;
if(ex == 0x7FF)
{
/* Error: NaN or Inf found! */
}
else if(ex == 0)
{
/* Denormalized */
tmp = di & 0xFFFFFFFFFFFFF;
}
else
{
/* Non-Denormalized */
tmp = di & 0xFFFFFFFFFFFFF;
tmp |= 0x10000000000000;
tmp <<= ex-1;
}
if(di & 0x8000000000000000) (*sum) -= tmp;
else (*sum) += tmp;
}
If the sum is negative, negate it (calculate the absolute value of the average); in this case, you have to negate the result (average) later.
Perform an integer division of the sum (divide it by the number of elements).
Now calculate the (absolute value of) the average from the resulting large integer value:
double lsintToDouble(lsint sum)
{
int ex;
double result;
if(sum < 0x10000000000000)
{
*(uint64 *)&result = (uint64)sum;
}
else
{
ex = 1;
while(sum >= 0x20000000000000)
{
sum >>= 1;
ex++;
}
*(uint64 *)&result = (uint64)sum & 0xFFFFFFFFFFFFF;
*(uint64 *)&result |= ex<<52;
}
return result;
}
If the sum was negative and you calculate the absolute value, don't forget to negate the result.

- 17,897
- 3
- 19
- 38
-
3Re “2102-bit signed integer”: The highest bit position is 2^1024, the lowest bit position is 2^(−1022-52) = 2^−1074, so 2099 bits are needed for the value bits, and one more for the sign makes 2100. Although one must still account for infinities and NaNs. – Eric Postpischil Jul 14 '20 at 09:58
-
1@EricPostpischil Thanks for the clarification; however, because 2102>2100, it would at least work with 2102 bits... – Martin Rosenau Jul 14 '20 at 10:00
-
1) Let's just call it good and use `int4096_t`. 2) A lot of work doing the precise addition, yet the conversion from `lsint` to `double` truncates rather than rounds losing a 1/2 ULP - suppose that is LAAETTR. – chux - Reinstate Monica Jul 17 '20 at 20:27
To minimize loss in precision, I use an array of 2048 doubles, indexed by the exponent, which means the code is implementation specific and expects doubles to be IEEE formatted doubles. Numbers are added into the array, only adding numbers with identical exponents. To get the actual sum, the array is then added from smallest to largest.
/* clear array */
void clearsum(double asum[2048])
{
size_t i;
for(i = 0; i < 2048; i++)
asum[i] = 0.;
}
/* add a number into array */
void addtosum(double d, double asum[2048])
{
size_t i;
while(1){
/* i = exponent of d */
i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;
if(i == 0x7fe){ /* max exponent, could be overflow */
asum[i] += d;
return;
}
if(asum[i] == 0.){ /* if empty slot store d */
asum[i] = d;
return;
}
d += asum[i]; /* else add slot to d, clear slot */
asum[i] = 0.; /* and continue until empty slot */
}
}
/* return sum from array */
double returnsum(double asum[2048])
{
double sum = 0.;
size_t i;
for(i = 0; i < 2048; i++)
sum += asum[i];
return sum;
}

- 27,407
- 3
- 36
- 61
-
2Nice idea, though far from perfect. E.g., adding `(1+eps) + (1+2eps)` will have a rounding error (which may get significant if you later subtract similar values). A slightly more "straight-forward" approach would be to accumulate using a `(1024)+(1077)` bit fixed-point number (maybe with some extra bits at the top to avoid overflow). – chtz Jul 14 '20 at 07:25
-
-
By `eps` I mean the machine precision (the smallest number, such that `1+eps > 1`), `1+eps` and `1+2*eps` shall be individual input numbers, they have the same exponent, and their sum will add to `2+3*eps` which is rounded to `2+4*eps` (round-to-even). If you later also add `-2.0`, your result is `4*eps`, while it should be `3*eps`. – chtz Jul 14 '20 at 07:43
-
In the case of the "normal" `double` data type, `i == 0x7FF` would not be the maximum exponent, but `i == 0x7FE` would be; `0x7FF` is used for `NaN` and `Inf` values which indicate errors... – Martin Rosenau Jul 14 '20 at 07:45
-
1In case my point was not clear, here is a simple failing example (`naive_sum` happens to work because I artificially ordered the values in a way that it does): https://godbolt.org/z/rx79x7 – chtz Jul 14 '20 at 08:01
-
1@chtz: “the smallest number, such that `1+eps > 1`” is an incorrect definition for the “machine precision,” despite it being used widely. For IEEE-754, the correct machine precision is 2^-52, but that definition gives 2^-53+2^-105, since `1+(0x1p-53+0x1p-105)` will yield 1 due to rounding. A correct definition is the difference between 1 and the next greatest representable number. – Eric Postpischil Jul 14 '20 at 09:53
-
Rather than `i = ((size_t)((*(unsigned long long *)&d)>>52))&0x7ff;`, consider `frexp(d, &i); i -= DBL_MIN_EXP;` – chux - Reinstate Monica Jul 17 '20 at 20:32
-
@chux-ReinstateMonica - this is really old code. It may have first been used on an Atari ST (1987). – rcgldr Jul 17 '20 at 22:00
Given OP's:
The values I work with are not expected to be on any extreme side, but I do not have a "feel" for the numbers
A middle of the road approach for increased precision when the values have the same sign and within a few magnitudes of each other:
2 passes, find coarse average and then find average deviation from the average.
double average(size_t rsi, const double *rdi) {
double sum = 0.0;
for (size_t i=0; i<rsi; i++) {
sum += rdi[i];
}
double course_average = sum/rsi;
sum = 0.0;
for (size_t i=0; i<rsi; i++) {
sum += rdi[i] - course_average;
}
double differnce_average = sum/rsi;
return course_average + differnce_average;
}

- 143,097
- 13
- 135
- 256