Optimal implementation of iterative Kahan summation

Question

Intro
Kahan summation / compensated summation is technique that addresses compilers´ inability to respect the associative property of numbers. Truncation errors results in (a+b)+c not being exactly equal to a+(b+c) and thus accumulate an undesired relative error on longer series of sums, which is a common obstacle in scientific computing.

Task
I desire the optimal implementation of Kahan summation. I suspect that the best performance may be achieved with handcrafted assembly code.

Attempts
The code below calculates the sum of 1000 random numbers in range [0,1] with three approaches.

Standard summation: Naive implementation which accumulates a root mean square relative error that grows as O(sqrt(N))
Kahan summation [g++]: Compensated summation using the c/c++ function "csum". Explanation in comments. Note that some compilers may have default flags that invalidate this implementation (see output below).
Kahan summation [asm]: Compensated summation implemented as "csumasm" using the same algorithm as "csum". Cryptic explanation in comments.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

extern "C" void csumasm(double&, double, double&);
__asm__(
    "csumasm:\n"
    "movsd  (%rcx), %xmm0\n" //xmm0 = a
    "subsd  (%r8), %xmm1\n"  //xmm1 - r8 (c) | y = b-c
    "movapd %xmm0, %xmm2\n"  
    "addsd  %xmm1, %xmm2\n"  //xmm2 + xmm1 (y) | b = a+y
    "movapd %xmm2, %xmm3\n" 
    "subsd  %xmm0, %xmm3\n"  //xmm3 - xmm0 (a) | b - a
    "movapd %xmm3, %xmm0\n"  
    "subsd  %xmm1, %xmm0\n"  //xmm0 - xmm1 (y) | - y
    "movsd  %xmm0, (%r8)\n"  //xmm0 to c
    "movsd  %xmm2, (%rcx)\n" //b to a
    "ret\n"
);

void csum(double &a,double b,double &c) { //this function adds a and b, and passes c as a compensation term
    double y = b-c; //y is the correction of b argument
    b = a+y; //add corrected b argument to a argument. The output of the current summation
    c = (b-a)-y; //find new error to be passed as a compensation term
    a = b;
}

double fun(double fMin, double fMax){
    double f = (double)rand()/RAND_MAX;
    return fMin + f*(fMax - fMin); //returns random value
}

int main(int argc, char** argv) {
    int N = 1000;

    srand(0); //use 0 seed for each method
    double sum1 = 0;
    for (int n = 0; n < N; ++n)
        sum1 += fun(0,1);

    srand(0);
    double sum2 = 0;
    double c = 0; //compensation term
    for (int n = 0; n < N; ++n)
        csum(sum2,fun(0,1),c);

    srand(0);
    double sum3 = 0;
    c = 0;
    for (int n = 0; n < N; ++n)
        csumasm(sum3,fun(0,1),c);

    printf("Standard summation:\n %.16e (error: %.16e)\n\n",sum1,sum1-sum3);
    printf("Kahan compensated summation [g++]:\n %.16e (error: %.16e)\n\n",sum2,sum2-sum3);
    printf("Kahan compensated summation [asm]:\n %.16e\n",sum3);
    return 0;
}

The output with -O3 is:

Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [g++]:
 5.1991955320902127e+002 (error: 0.0000000000000000e+000)

Kahan compensated summation [asm]:
 5.1991955320902127e+002

The output with -O3 -ffast-math

Standard summation:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [g++]:
 5.1991955320902093e+002 (error: -3.4106051316484809e-013)

Kahan compensated summation [asm]:
 5.1991955320902127e+002

It is clear that -ffast-math destroys the Kahan summation arithmetic, which is unfortunate because my program requires the use of -ffast-math.

Question

Is it possible to construct a better/faster asm x64 code for Kahan's compensated summation? Perhaps there is a clever way to skip some of the movapd instructions?
If no better asm codes are possible, is there a c++ way to implement Kahan summation that can be used with -ffast-math without devolving to the naive summation? Perhaps a c++ implementation is generally more flexible for the compiler to optimize.

Ideas or suggestions are appreciated.

Further information

The contents of "fun" cannot be inlined, but the "csum" function could be.
The sum must be calculated as an iterative process (the corrected term must be applied on every single addition). This is because the intended summation function takes an input that depends on the previous sum.
The intended summation function is called indefinitely and several hundred million times per second, which motives the pursuit of a high performance low-level implementation.
Higher precision arithmetic such as long double, float128 or arbitrary precision libraries are not to be considered as higher precision solutions due to performance reasons.

Edit: Inlined csum (doesn't make much sense without the full code, but just for reference)

        subsd   xmm0, QWORD PTR [rsp+32]
        movapd  xmm1, xmm3
        addsd   xmm3, xmm0
        movsd   QWORD PTR [rsp+16], xmm3
        subsd   xmm3, xmm1
        movapd  xmm1, xmm3
        subsd   xmm1, xmm0
        movsd   QWORD PTR [rsp+32], xmm1

It's not surprising that `-ffast-math` breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. What do you mean by saying "generally seems to be an issue with the `-O3` flag"? G++'s default is strict FP math. If you have another version of `csum` that breaks with `g++ -O3` (but works with `g++ -O0`), that might be a compiler bug but could be a bug in your code. — Peter Cordes, Aug 01 '19 at 04:06
I believe you are right. At least I can confirm that for my current compilers. However, there might be some compilers that will invalidate Kahan arithmetic: https://en.wikipedia.org/wiki/Kahan_summation_algorithm#Possible_invalidation_by_compiler_optimization — Egeris, Aug 01 '19 at 04:31
Intel's C compiler enables fast-math by default. It has an option for strict-fp. So it's similar to gcc except the default is `-ffast-math`. (This makes ICC look good in benchmarks). It would violate the ISO C++ standard for a compiler to break Kahan summation with fast-math disabled. C++ is like C: the FP math rules prohibit unsafe math optimizations. fast-math options relax the ISO C++ rules, but all compilers support no-fast-math whether it's the default or not. — Peter Cordes, Aug 01 '19 at 04:38
Related [question](https://stackoverflow.com/questions/37486225/kahan-summation-algorithm-has-big-computing-error-when-it-is-compiled-by-gcc) — 1201ProgramAlarm, Aug 01 '19 at 04:46
@PeterCordes Thanks for clarification. Updated post to reduce confusion. It certainly makes more sense that the default flags are responsible for Kahan arithmetic invalidation rather than the -O3 flag itself. — Egeris, Aug 01 '19 at 05:01
"my program requires the use of -ffast-math" what does that mean? That it runs too slowly without fast-math? What "performance reasons" do you see against using the 80-bit x87 floating point? — Marc Glisse, Aug 01 '19 at 06:53
@MarcGlisse -ffast-math provides a significant boost to numerical integration tasks. There's nothing wrong with 80-bit x87, but I will use that for the same tasks assisted by 80-bit x87 Kahan summation. And likewise use __float128 assisted by __float128 Kahan summation, can you imagine? It's too expensive NOT to use Kahan summation. — Egeris, Aug 01 '19 at 07:27
Your comment makes me think you might be under the impression that double-double (Kahan) makes your computation exact. It does not. It only simulates extra precision compared to a single double (like you can simulate a 128bit integer with 2 64bit integers). A fair comparison would be double-double vs plain __float128 (which indeed is slow, except on some POWER computers). A natural question is, if double is not precise enough for your application but double-double is, then what about the intermediate 80bit type... — Marc Glisse, Aug 01 '19 at 07:57
Your "Inlined csum" still ends with a `ret`. Therefore it's obviously *not* inlined into a loop. It looks like you just converted it to taking and returning `a` by value instead of by reference and this is still a stand-alone definition. — Peter Cordes, Aug 01 '19 at 08:08
Mistake. The posted one is another code, where I make the function return a rather than using the reference (worth mentioning). I have difficulties tracking which instructions actually belong to the inlined csum. — Egeris, Aug 01 '19 at 08:32
@Egeris: If you put your code on https://godbolt.org/ you can get color coded asm output, and hover the mouse over a line to highlight the corresponding line(s) in the other window. Or do that locally by compiling with `-O3 -g` and disassemble with `objdump` interleaving source + asm. — Peter Cordes, Aug 01 '19 at 08:46
@MarcGlisse It would be obscure to think that floating point arithmetic can be exact. However, anyone who has used Kahan summation will know that it can improve the accuracy of common tasks like numerical integration by several orders of magnitude. If I need ~18 digits of precision, the 80bit type would be my choice, but also assisted by Kahan summation to avoid the undesired growth of the relative error. __float128 is in my experience about a hundred times slower than double on standard hardware, so that is certainly less applicable than the 80bit type. — Egeris, Aug 01 '19 at 08:48

Peter Cordes · Accepted Answer · 2019-08-01T08:57:11.080

You can put functions that need to not use -ffast-math (like a csum loop) in a separate file that gets compiled without -ffast-math.

Possibly you could also use __attribute__((optimize("no-fast-math"))), but https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html says that optimization-level pragmas and attributes aren't "suitable in production code", unfortunately.

update: apparently part of the question was based on a misunderstanding that -O3 wasn't safe, or something? It is; ISO C++ specifies FP math rules that are like GCC's -fno-fast-math. Compiling everything with just -O3 apparently makes the OP's code run quickly and safely. See the bottom of this answer for workarounds like OpenMP to get some of the benefit of fast-math for some parts of your code without actually enabling -ffast-math.

ICC defaults to fast-path so you have to specifically enable FP=strict for it to be safe with -O3, but gcc/clang default to fully strict FP regardless of other optimization settings. (except -Ofast = -O3 -ffast-math)

You should be able to vectorize Kahan summation by keeping a vector (or four) of totals and an equal number of vectors of compensations. You can do that with intrinsics (as long as you don't enable fast-math for that file).

e.g. use SSE2 __m128d for 2 packed additions per instruction. Or AVX __m256d. On modern x86, addpd / subpd have the same performance as addsd and subsd (1 uop, 3 to 5 cycle latency depending on microarchitecture: https://agner.org/optimize/).

So you're effectively doing 8 compensated summations in parallel, each sum getting every 8th input element.

Generating random numbers on the fly with your fun() is significantly slower than reading them from memory. If your normal use-case has data in memory, you should be benchmarking that. Otherwise I guess scalar is interesting.

If you're going to use inline asm, it would be much better to actually use it inline so you can get multiple inputs and multiple outputs in XMM registers with Extended asm, not stored/reloaded through memory.

Defining a stand-alone function that actually takes args by reference looks pretty performance-defeating. (Especially when it doesn't even return either of them as a return value to avoid one of the store/reload chains). Even just making a function call introduces a lot of overhead by clobbering many registers. (Not as bad in Windows x64 as in x86-64 System V where all the XMM regs are call-clobbered, and more of the integer regs.)

Also your stand-alone function is specific to the Windows x64 calling convention so it's less portable than inline asm inside a function would be.

And BTW, clang managed to implement csum(double&, double, double&): with only two movapd instructions, instead of the 3 in your asm (which I assume you copied from GCC's asm output). https://godbolt.org/z/lw6tug. If you can assume AVX is available, you can avoid any.

And BTW, movaps is 1 byte smaller and should be used instead. No CPUs have had separate data domains / forwarding networks for double vs. float, just vec-FP vs. vec-int (vs. GP integer)

But really by far your bet is to get GCC to compile a file or function without -ffast-math. https://gcc.gnu.org/wiki/DontUseInlineAsm. That lets the compiler avoid the movaps instructions when AVX is available, besides letting it optimize better when unrolling.

If you're willing to accept the overhead of a function-call for every element, you might as well let the compiler generate that asm by putting csum in a separate file. (Hopefully link-time optimization respects -fno-fast-math for one file, perhaps by not inlining that function.)

But it would be much better to disable fast-math for the whole function containing the summation loop by putting it in a separate file. You may be stuck choosing where non-inline function-call boundaries need to be, based on compiling some code with fast-math and others without.

Ideally compile all of your code with -O3 -march=native, and profile-guided optimization. Also -flto link-time optimization to enable cross-file inlining.

It's not surprising that -ffast-math breaks Kahan summation: treating FP math as associative is one of the main reasons to use fast-math. If you need other parts of -ffast-math like -fno-math-errno and -fno-trapping-math so math functions can inline better, then enable those manually. Those are basically always safe and a good idea; nobody checks errno after calling sqrt so that requirement to set errno for some inputs is just a terrible misdesign of C that burdens implementations unnecessarily. GCC's -ftrapping-math is on by default even though it's broken (it doesn't always exactly reproduce the number of FP exceptions you'd get if you unmasked any) so it should really be off by default. Turning it off doesn't enable any optimizations that would break NaN propagation, it only tells GCC that the number of exceptions isn't a visible side-effect.

Or maybe try -ffast-math -fno-associative-math for your Kahan summation file, but that's the main one that's needed to auto-vectorize FP loops that involve reductions, and helps in other cases. But still, there are several other valuable optimizations that you'd still get.

Another way to get optimizations that normally require fast-math is #pragma omp simd to enable auto-vectorization with OpenMP even in files compiled without auto-vectorization. You can declare an accumulator variable for a reduction to let gcc reorder operations on it as if they were associative.

Unfortunately, my task (unlike the example) involve a summation that cannot be computed in parallel. Each function call ("fun") takes the output of the previous function call as input. This is why my current approach has been a stand-alone function. Your concern about inline assembly is justified, and I would not trust any asm-solution without thorough benchmark tests. I will try your suggestions on the -ffast-math solutions. — Egeris, Aug 01 '19 at 04:44
@Egeris: So your real use-case is like if this needed the running total as an input to `fun()`? I still don't see a reason to use inline asm unless the real `fun()` definitely needs it, and also needs to inline so you can't just put the loop in a file compiled without fast-math. Is your real `fun()` slow enough to hide FP latency of Kahan summation + store/reload? Or does that dependency chain still bottleneck your overall performance? If the latter, then you need both summation to inline into the loop so vars can stay in registers without a store/reload. (Whether or not it's inline asm) — Peter Cordes, Aug 01 '19 at 04:51
Correct. The running total is an input to fun(). The computational cost of my current Kahan implementation accounts for 1-25% of the total running time (it depends on the task). I think you are right about keeping the vars in registers, but this might force me to convert other parts of the code in asm to properly take advantage of these registers. — Egeris, Aug 01 '19 at 05:16
@Egeris: no, not if you use GNU C Extended asm with inputs and outputs in `"x"` operands like I said in my answer. If you have conflicting fast-math requirements within the same loop, that's your best bet for efficiency without actually writing *everything* in asm. — Peter Cordes, Aug 01 '19 at 05:21
@Egeris: `csum` is a tiny function, much smaller that the out-of-order execution window of current mainstream x86 CPUs. (Like 10 uops + call/ret overhead out of Skylakes 224 uop ROB / 97 uop scheduler). It makes no sense to say that it "takes" 25% of the total running time because the run time isn't a simple sum of parts. (Unless your entire program is totally bottlenecked on front-end uop throughput (4/clock) or back-end FMA-port execution-unit throughput (2/clock)). Maybe you mean your total program runs up to 25% slower with Kahan vs. simple summation? So probably a latency bottleneck. — Peter Cordes, Aug 01 '19 at 05:57
When I said "running time" I was simply referring to "iterations per second" counter in my application, which increases dramatically when disabling compensated summation, especially for small tasks. I have no estimates on latency. Also, It turns out that ffast-math is grossly unnecessary in the summation file when I can use the other flags without breaking the Kahan arithmetic, leaving a very easy solution in favor of the DontUseInlineAsm doctrine. But since GCC has seemingly produced less-than-optimal codes (as you confirmed), I will have to rely on the benchmark/counter. — Egeris, Aug 01 '19 at 07:11
@Egeris: So you're talking about comparing two versions of your program (with kahan vs. simple summation), *not* trying to profile within the application and account time to instructions. The right phrasing for that would be "my Kahan implementation slows down the total run time by 1-25% vs. simple summation". In the faster cases (like only 1%), the extra sum latency (and uop throughput required) is probably hidden by out-of-order exec. — Peter Cordes, Aug 01 '19 at 07:18
re: sub-optimal: 1. that's not rare for gcc, there are normally minor missed optimizations all over the place. An extra `movapd` usually doesn't hurt much, especially on recent mainstream x86 CPUs where mov-elimination runs it with zero latency. 2. that's only the stand-alone version. You haven't shown the code GCC makes after inlining `csum` into a loop. — Peter Cordes, Aug 01 '19 at 07:20
I apologize for being unclear, and there was no way you could know I was talking about the iteration rate. The 1-25% is a rough estimate ranging from heavy tasks (1%) to light task (25%) where the cost of fun() instructions are comparable to csum. After inlining, the first and last instruction movsd rcx instructions are cut off (added to post), but the 3 movapd are still there. I did not check clang yet. — Egeris, Aug 01 '19 at 08:11
@Egeris: you can get rid of the `movapd` instructions by compiling with `-march=native` on a CPU that has AVX. — Peter Cordes, Aug 01 '19 at 08:47
Not only that, the compiled code consistently performs better for certain numerical tasks. I would worry about it running on other computers though. — Egeris, Aug 01 '19 at 09:10
@Egeris: well yeah, the point of `-march=native` is to make binaries you *don't* plan to distribute to other machines. For that you should use `-march=nehalem -mtune=haswell` or some variation on that: enable ISAs for some baseline level of extensions but tune for a modern CPU. (Or `-mtune=generic` in case `-mtune=haswell` does anything really bad for AMD). But without runtime CPU dispatching, compiling with `-O3 -flto -march=native` and PGO separately on every machine should give best results. (Also try with clang; it unrolls nicely by default). AVX helps even without auto-vectorization. — Peter Cordes, Aug 01 '19 at 09:15
What does GCC even mean by "not suitable in production code"? Are optimization attributes some unreliable bug-filled swamp that they have decided not to care about? — harold, Aug 01 '19 at 13:02

Optimal implementation of iterative Kahan summation

1 Answers1