Precise sum of floating point numbers

Question

I am aware of a similar question, but I want to ask for people opinion on my algorithm to sum floating point numbers as accurately as possible with practical costs.

Here is my first solution:

put all numbers into a min-absolute-heap. // EDIT as told by comments below
pop the 2 smallest ones.
add them.
put the result back into the heap.
continue until there is only 1 number in the heap.

This one would take O(n*logn) instead of normal O(n). Is that really worth it?

The second solution comes from the characteristic of the data I'm working on. It is a huge list of positive numbers with similar order of magnitude.

a[size]; // contains numbers, start at index 0
for(step = 1; step < size; step<<=1)
    for(i = step-1; i+step<size; i+=2*step)
        a[i+step] += a[i];
    if(i < size-1)
        a[size-1] += a[i];

The basic idea is to do sum in a 'binary tree' fashion.

Note: it's a pseudo C code. step<<=1 means multiply step by 2. This one would take O(n). I feel like there might be a better approach. Can you recommend/criticize?

It seems that you are implicitly assuming the numbers to sum are positive. If they could be of different signs, a strategy would be something like “adding the number of smallest magnitude and of sign opposite to the current count, if possible”. — Pascal Cuoq, Nov 16 '12 at 13:37
Oh, yes. I did assume that my numbers are positive. Editing the question now. Thanks. — Apiwat Chantawibul, Nov 16 '12 at 13:44
are you assuming the numbers in your second example are sorted? — AShelly, Nov 16 '12 at 13:47
The elements will be put into the heap in increasing order, so you can use two queues instead. This produces `O(n)` if the numbers are pre-sorted — John Dvorak, Nov 16 '12 at 13:48
@JanDvorak Though, my specific list is not pre-sorted. I'm still interested in how using 2 queues can make it O(n) in that case. Still can't imagine it. — Apiwat Chantawibul, Nov 16 '12 at 13:55
When choosing algorithms consider the following set of numbers: `{DBL_MAX, 1, -DBL_MAX}`. If all your algorithm does is decide what order to sum the numbers in, then it gets the incorrect answer `0` unless it adds the two *large* numbers first, in which case it gets the correct answer `1`. So, your min-heap fails for that particular input, as for that matter do most heuristics for this job. Kahan gets it right, I think. — Steve Jessop, Nov 16 '12 at 14:03
Your 2nd algorithm is also O(N lg N). It goes through all the elements, then half the elements, then half again.... It has the same runtime as this simpler equivalent: `j=size-1; while(j>0) for(i=0;i — AShelly, Nov 16 '12 at 14:08
@AShelly My second algorithm is not O(N lg N) but O(N) because in the first 'step loop' it adds N/2 times, the second time it adds N/4 times, the third time it adds N/8 times, and so on — Apiwat Chantawibul, Nov 16 '12 at 14:22
oops. This is what happens when an aerospace engineer attempts to do computer science.. — AShelly, Nov 16 '12 at 14:58
@hammar: well, might be `2*n` depending what order you add the terms ;-) — Steve Jessop, Nov 17 '12 at 02:07

Pascal Cuoq · Accepted Answer · 2012-11-19T15:52:09.163

21

Kahan's summation algorithm is significantly more precise than straightforward summation, and it runs in O(n) (somewhere between 1-4 times slower than straightforward summation depending how fast floating-point is compared to data access. Definitely less than 4 times slower on desktop hardware, and without any shuffling around of data).

Alternately, if you are using the usual x86 hardware, and if your compiler allows access to the 80-bit long double type, simply use the straightforward summation algorithm with the accumulator of type long double. Only convert the result to double at the very end.

If you really need a lot of precision, you can combine the above two solutions by using long double for variables c, y, t, sum in Kahan's summation algorithm.

edited Nov 19 '12 at 15:52

answered Nov 16 '12 at 13:40

Pascal Cuoq

79,187
7
161
281

Thank you for recommending Kahan's algorithm. Let me read it a bit and I'll come back to accept the answer. – Apiwat Chantawibul Nov 16 '12 at 13:56
How does Kahan summation with double precision internally compare to plain quad precision? – Martin Ueding Jan 01 '19 at 14:54
1

@MartinUeding It's possible to build sequences that sum more precisely with each of them. For an “ordinary” sequence of many values of the same sign and magnitude, quad-precision is slightly more precise, because the significand of quand-precision is slightly more than twice that of double-precision. – Pascal Cuoq Jan 01 '19 at 16:48

score 9 · Answer 2 · answered Nov 16 '12 at 13:40

9

If you are concerned about reducing the numerical error in your summation then you may be interested in Kahan's algorithm.

answered Nov 16 '12 at 13:40

High Performance Mark

77,191
7
105
161

Thank you for recommending Kahan's algorithm. I'm new to stackoverflow, what do you do if there's 2 same answer? – Apiwat Chantawibul Nov 16 '12 at 13:58
6

Study the same thing twice ? – High Performance Mark Nov 16 '12 at 14:00
2

@Billiska generally the most complete/helpful is accepted but you can still upvote other helpful anwers – ratchet freak Nov 16 '12 at 14:54

score 2 · Answer 3 · answered Nov 18 '12 at 01:16

My guess is that your binary decomposition will work almost as well as Kahan summation.

Here is an example to illustrate it:

#include <stdio.h>
#include <stdlib.h>
#include <algorithm>

void sumpair( float *a, float *b)
{
    volatile float sum = *a + *b;
    volatile float small = sum - std::max(*a,*b);
    volatile float residue = std::min(*a,*b) - small;
    *a = sum;
    *b = residue;
}

void sumpairs( float *a,size_t size, size_t stride)
{
    if (size <= stride*2 ) {
        if( stride<size )
            sumpair(a+i,a+i+stride);
    } else {
        size_t half = 1;
        while(half*2 < size) half*=2;;
        sumpairs( a , half , stride );
        sumpairs( a+half , size-half , stride );
    }
}

void sumpairwise( float *a,size_t size )
{
    for(size_t stride=1;stride<size;stride*=2)
        sumpairs(a,size,stride);
}

int main()
{
    float data[10000000];
    size_t size= sizeof data/sizeof data[0];
    for(size_t i=0;i<size;i++) data[i]=((1<<30)*-1.0+random())/(1.0+random());

    float naive=0;
    for(size_t i=0;i<size;i++) naive+=data[i];
    printf("naive      sum=%.8g\n",naive);

    double dprec=0;
    for(size_t i=0;i<size;i++) dprec+=data[i];
    printf("dble prec  sum=%.8g\n",(float)dprec);

    sumpairwise( data , size );
    printf("1st approx sum=%.8g\n",data[0]);
    sumpairwise( data+1 , size-1);
    sumpairwise( data , 2 );
    printf("2nd approx sum=%.8g\n",data[0]);
    sumpairwise( data+2 , size-2);
    sumpairwise( data+1 , 2 );
    sumpairwise( data , 2 );
    printf("3rd approx sum=%.8g\n",data[0]);
    return 0;
}

I declared my operands volatile and compiled with -ffloat-store to avoid extra precision on x86 architecture

g++  -ffloat-store  -Wl,-stack_size,0x20000000 test_sum.c

and get: (0.03125 is 1ULP)

naive      sum=-373226.25
dble prec  sum=-373223.03
1st approx sum=-373223
2nd approx sum=-373223.06
3rd approx sum=-373223.06

This deserve a little explanation.

I first display naive summation
Then double precision summation (Kahan is roughly equivalent to that)
The 1st approximation is the same as your binary decomposition. Except that I store the sum in data[0] and that I care of storing residues. This way, the exact sum of data before and after summation is unchanged
This enables me to approximate the error by summing the residues at 2nd iteration in order to correct the 1st iteration (equivalent to applying Kahan on binary summation)
By iterating further I can further refine the result and we see a convergence

score 1 · Answer 4 · answered Nov 16 '12 at 14:06

1

The elements will be put into the heap in increasing order, so you can use two queues instead. This produces O(n) if the numbers are pre-sorted.

This pseudocode produces the same results as your algorithm and runs in O(n) if the input is pre-sorted and the sorting algorithm detects that:

Queue<float> leaves = sort(arguments[0]).toQueue();
Queue<float> nodes = new Queue();

popAny = #(){
       if(leaves.length == 0) return nodes.pop();
  else if(nodes.length == 0) return leaves.pop();
  else if(leaves.top() > nodes.top()) return nodes.pop();
  else return leaves.pop();
}

while(leaves.length>0 || nodes.length>1) nodes.push(popAny()+popAny());

return nodes.pop();

answered Nov 16 '12 at 14:06

John Dvorak

26,799
13
69
83

4

I implemented both the Kahan summation algorithm and sort-then-sum using float (32-bit IEEE 754) and compared them to the result obtained using double (64-bit) to sum 1024 numbers selected randomly and uniformly from [0, 1). I ran a few dozen trials. In some, Kahan and sort returned the same value. In most, Kahan had less error. In none did sorting produce less error. – Eric Postpischil Nov 16 '12 at 15:00
2

@EricPostpischil I was replying to the asker's comment `I'm still interested in how using 2 queues can make it O(n) in that case. Still can't imagine it.` – John Dvorak Nov 16 '12 at 15:41
@JanDvorak Closer inspection. The maximum number of elements in `node` queue can reach N/2 with this input = {k,k+1,k+2,...,2*k} where k is positive. Hence, your algorithm is O(N/2 lg N/2) which is the same as O(N lg N). – Apiwat Chantawibul Nov 16 '12 at 16:33
1

@Billiska Inserting an element into a queue is a constant time operation. Removing an element from a queue is a constant time operation. The while loop will run exactly `N-1` times. The while loop by itself is linear in the number of elements. The only non-linear step is the sorting step. – John Dvorak Nov 16 '12 at 17:08
Elaborating about the queue: If you don't care about the space, a queue is `{data=[], start=0, end=0, push=#(x){data[end++]=x}, pop=#(){if(start==end) die(); return data[start++]}}` – John Dvorak Nov 16 '12 at 17:15
A linked-list queue is also constant-time: `{head:null, tail:null, push=#(x){if(tail){tail.next={data:x,next:null}; tail=tail.next}else{head=tail={data:x}, pop:#(){if(!head) die(); var r=head; head=head.next; if(!head) tail=null; return r};` – John Dvorak Nov 16 '12 at 17:29
I see. That's something we think of differently. Because I think that the sum of 2 smallest element can still be large compared to the rest of the inputs. That's why in the original solution I put addition result back into the 'heap' which is a O(lg N) operation. – Apiwat Chantawibul Nov 16 '12 at 18:16
That's why I suggest two queues. A queue has constant-time ops. – John Dvorak Nov 16 '12 at 18:19
At each step, I take either one of the original values or one of the new values, whichever is lower. The original values get sorted in a separate step and the new values get created in ascending order. – John Dvorak Nov 16 '12 at 18:21
Well, it's now a different algorithm. the order of summing will be different for the same input. But indeed, yours will do things in O(N). – Apiwat Chantawibul Nov 16 '12 at 18:22
@Billiska My algorithm computes the same sums as your algorithm. – John Dvorak Nov 16 '12 at 18:28
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/19652/discussion-between-billiska-and-jan-dvorak) – Apiwat Chantawibul Nov 16 '12 at 18:29

Precise sum of floating point numbers

4 Answers4

Linked