Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.
Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.
Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.
Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
extern "C" void csumasm(double&, double, double&);
__asm__(
"csumasm:\n"
"movsd (%rcx), %xmm0\n" //xmm0 = a
"subsd (%r8), %xmm1\n" //xmm1 - r8 (c) | y = b-c
"movapd %xmm0, %xmm2\n"
"addsd %xmm1, %xmm2\n" //xmm2 + xmm1 (y) | b = a+y
"movapd %xmm2, %xmm3\n"
"subsd %xmm0, %xmm3\n" //xmm3 - xmm0 (a) | b - a
"movapd %xmm3, %xmm0\n"
"subsd %xmm1, %xmm0\n" //xmm0 - xmm1 (y) | - y
"movsd %xmm0, (%r8)\n" //xmm0 to c
"movsd %xmm2, (%rcx)\n" //b to a
"ret\n"
);
void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
double y = b-c; //y is the correction of b argument
b = a+y; //add corrected b argument to a argument. The output of the current summation
c = (b-a)-y; //find new error to be passed as a compensation term
a = b;
}
double fun(double fMin, double fMax){
double f = (double)rand()/RAND_MAX;
return fMin + f*(fMax - fMin); //returns random value
}
int main(int argc, char** argv) {
int N = 1000;
srand(0); //use 0 seed for each method
double sum1 = 0;
for (int n = 0; n < N; ++n)
sum1 += fun(0,1);
srand(0);
double sum2 = 0;
double c = 0; //compensation term
for (int n = 0; n < N; ++n)
csum(sum2,fun(0,1),c);
srand(0);
double sum3 = 0;
c = 0;
for (int n = 0; n < N; ++n)
csumasm(sum3,fun(0,1),c);
printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
return 0;
}
The output with -O3 is:
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902127e+002 (error: 0.0000000000000000e+000)
Kahan compensated summation [asm]:
5.1991955320902127e+002
The output with -O3 -ffast-math
Standard summation:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [g++]:
5.1991955320902093e+002 (error: -3.4106051316484809e-013)
Kahan compensated summation [asm]:
5.1991955320902127e+002
It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.
Question
Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.
Ideas or suggestions are appreciated.
Further information
- The contents of "fun" cannot be inlined, but the "csum" function could be.
- The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
- The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
- Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.
Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)
subsd xmm0, QWORD PTR [rsp+32]
movapd xmm1, xmm3
addsd xmm3, xmm0
movsd QWORD PTR [rsp+16], xmm3
subsd xmm3, xmm1
movapd xmm1, xmm3
subsd xmm1, xmm0
movsd QWORD PTR [rsp+32], xmm1