With pairwise summation, how many terms do I need to get an appreciably wrong result?

Question

Using a given species of fp numbers, say float16, it is straight forward to construct sums with totally wrong results. For example, using python/numpy:

import numpy as np

one = np.float16(1)
ope = np.nextafter(one,one+one)

np.array((ope,one,-one,-one)).cumsum()
# array([1.001, 2.   , 1.   , 0.   ], dtype=float16)

Here we have used cumsum to force naive summation. Left to its own devices numpy would have used a different order of summation, yielding a better answer:

np.array((ope,one,-one,-one)).sum()
# 0.000977

The above is based on cancellation. To rule out this class of examples, let us only allow non negative terms. For naive summation it is still easy to give examples with very wrong sums. The following sums 10^4 identical terms each equal to 10^-4:

np.full(10**4,10**-4,np.float16).cumsum()
# array([1.0e-04, 2.0e-04, 3.0e-04, ..., 2.5e-01, 2.5e-01, 2.5e-01],
  dtype=float16)

The last term is off by a factor of 4.

Again, allowing numpy to use pairwise summation gives a much better result:

np.full(10**4,10**-4,np.float16).sum()
# 1.0

It is possible to construct sums that beat pairwise summation. Choosing eps below resolution at 1 we can use 1, eps, 0, eps, 3x0, eps, 7x0, eps, 15x0, eps, ..., but this involves an insane number of terms.

My question: Using float16 and only non negative terms, how many terms are required to obtain from pairwise summation a result that is off by at least a factor of 2.

Bonus: Same question with "positive" instead of "non negative". Is it even possible?

Before you ask "how many terms are required", I think you should ask whether it's even *possible* to get a result that's off by that amount. Assuming that overflow to infinity is ruled out, I think it may be impossible. (And then the next question would be whether one can put an absolute bound on the ulps error, assuming overflow is avoided.) — Mark Dickinson, Sep 15 '19 at 12:14
@MarkDickinson just came to me that 0 is not positive so that doesn't count and you may still be right. — Paul Panzer, Sep 15 '19 at 12:25
Huh. Good point. So if every term is positive, then every term is at least 2**-24 in absolute value, and every pairwise sum of those terms is at least 2**-23 (even allowing for rounding), and every pairwise sum of *those* sums is at least 2**-22, and so on. My suspicion is that it's going to be hard to engineer a big relative error before hitting overflow. (And everything's going to be exact until we get over 2**-14.) — Mark Dickinson, Sep 15 '19 at 12:27
@MarkDickinson Yep, I think that kills it. I've engineered the question. I would appreciate your making a partial answer from your musings. — Paul Panzer, Sep 15 '19 at 12:35
Slightly related to this question: the _catastrophic cancellation_ can be avoided when doing Knuth Summations. This essentially keeps track of the loss of precision mimicking a sum as if you had twice the precision. (See [Accurate summation, dot product and polynomial evaluation in complex floating-point arithmetic](https://doi.org/10.1016/j.ic.2011.09.003)) — kvantour, Sep 19 '19 at 21:12
@kvantour Thanks! Actually, python's own `math.fsum` gives the most accurate fp representable answer. — Paul Panzer, Sep 19 '19 at 22:09
Depth 1432 (so 2^1432 terms) suffices for the true sum to exceed the computed sum by a factor of two. — David Eisenstat, Sep 26 '19 at 12:14
@DavidEisenstat Beautifully done! Can you say anything wrt what the "optimal" sum looks like? (I'm not C++ literate enough to read your code.) — Paul Panzer, Sep 26 '19 at 12:29
@PaulPanzer That's a whole other operation :P. I'd have to edit the code to track the paths that it used, then figure out how to make sense of a tree with that many nodes. I starred this question, maybe I'll have some time soon. — David Eisenstat, Sep 26 '19 at 12:42

score 12 · Accepted Answer · edited Oct 13 '21 at 11:55

Depth 1432 (so 2^1432 terms) suffices for the true sum to exceed the computed sum by a factor of two.

I had an idea for how to determine the number of terms needed to less than a factor of two.

We use dynamic programming to answer the following question: given a depth d and a target floating point sum s, what is the largest true sum of 2^d nonnegative float16s with pairwise sum s?

Let that quantity be T(d, s). We get a recurrence

T(0, s) = s,    for all s.
T(d, s) =            max            (T(d-1, a) + T(d-1, b)),    for all d, s.
          a, b : float16(a + b) = s

Each step of the recurrence will involve looping over about 2^29 combinations (since we can assume a ≤ b, and negative floats and special values are off limits), and the depth required won't exceed 10^4 or so by Hans's and your answer. Seems feasible to me.

DP code:

#include <algorithm>
#include <cstdio>
#include <vector>

using Float16 = int;
using Fixed = unsigned long long;

static constexpr int kExponentBits = 5;
static constexpr int kFractionBits = 10;
static constexpr Float16 kInfinity = ((1 << kExponentBits) - 1)
                                     << kFractionBits;

Fixed FixedFromFloat16(Float16 a) {
  int exponent = a >> kFractionBits;
  if (exponent == 0) {
    return a;
  }
  Float16 fraction = a - (exponent << kFractionBits);
  Float16 significand = (1 << kFractionBits) + fraction;
  return static_cast<Fixed>(significand) << (exponent - 1);
}

bool Plus(Float16 a, Float16 b, Float16* c) {
  Fixed exact_sum = FixedFromFloat16(a) + FixedFromFloat16(b);
  int exponent = 64 - kFractionBits - __builtin_clzll(exact_sum);
  if (exponent <= 0) {
    *c = static_cast<Float16>(exact_sum);
    return true;
  }
  Fixed ulp = Fixed{1} << (exponent - 1);
  Fixed remainder = exact_sum & (ulp - 1);
  Fixed rounded_sum = exact_sum - remainder;
  if (2 * remainder > ulp ||
      (2 * remainder == ulp && (rounded_sum & ulp) != 0)) {
    rounded_sum += ulp;
  }
  exponent = 64 - kFractionBits - __builtin_clzll(rounded_sum);
  if (exponent >= (1 << kExponentBits) - 1) {
    return false;
  }
  Float16 significand = rounded_sum >> (exponent - 1);
  Float16 fraction = significand - (Float16{1} << kFractionBits);
  *c = (exponent << kFractionBits) + fraction;
  return true;
}

int main() {
  std::vector<Fixed> greatest0(kInfinity);
  for (Float16 a = 0; a < kInfinity; a++) {
    greatest0[a] = FixedFromFloat16(a);
  }
  for (int depth = 1; true; depth++) {
    auto greatest1 = greatest0;
    for (Float16 a = 1; a < kInfinity; a++) {
      Fixed greatest0_a = greatest0[a];
      for (Float16 b = a; b < kInfinity; b++) {
        Float16 c;
        if (!Plus(a, b, &c)) {
          continue;
        }
        Fixed& value = greatest1[c];
        value = std::max(value, greatest0_a + greatest0[b]);
      }
    }

    std::vector<double> ratios;
    ratios.reserve(kInfinity - 1);
    for (Float16 a = 1; a < kInfinity; a++) {
      ratios.push_back(greatest1[a] / static_cast<double>(FixedFromFloat16(a)));
    }
    std::printf("depth %d, ratio = %.17g\n", depth,
                *std::max_element(ratios.begin(), ratios.end()));
    greatest0.swap(greatest1);
  }
}

I'll run this and post an update when it's done.

David Eisenstat · Answer 2 · 2019-09-15T13:07:15.523

It would take such a large number of terms that it's effectively impossible (if zeros are allowed) or actually impossible (if zeros are not allowed, due to overflow). Wikipedia summarizes some error bounds due to Nicolas Higham. Since all of the terms are nonnegative, the condition number is 1, hence the relative error for n terms is bounded as |E_n|/|S_n| ≤ ε log₂ n / (1 - ε log₂ n), where ε is the machine epsilon. To be off by a factor of two, we would need |E_n| ≥ |S_n|, which is possible only if ε log₂ n ≥ 1/2, which is equivalent to n ≥ 2^{1/(2 ε)} = 2¹⁰²⁴ for float16.

Hans Olsson · Answer 3 · 2019-09-23T14:58:28.347

The remaining issue is whether the sum is so sharp that you can get a relative error of 2 in pair-wise summation if you allow zero in the sum (*).

The simple answer is yes, by padding a bad sequence for cum-sum with an exponential number of zeros, as follows (where a1, a2, a3, ... an is problematic for the normal sum):

a1,
a2,
a3, 0,
a4, 0, 0, 0,
a5, 0, 0, 0, 0, 0, 0, 0,
a6, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
...

It will generate the same sum, with the same round-off error for pair-wise summation, and you "only" need 2**(n-1) terms instead of n. Thus since 10**4 terms can generate a factor of 4 for normal summation, then 2**(10**4-1) terms can give a factor of 4 for pair-wise summation.

*: The answer by David Eistenstat shows that disallowing zero the sum will overflow before getting that problematic. (I assume that the pair-wise summation recurses to the end.)

Thank you for your answer. I'm afraid, though, that as it stands it offers nothing new. The construction you propose is the same as the one already described in the question (last paragraph but two _"It is possible to construct sums ..."_) except that the latter makes the obvious saving of cutting the initial ramp-up. What I was hoping for were better bounds, i.e. a lower bound larger than the one David Eisenstat provided and/or a lower one than the zero fill (or a zero fill with better nonzeros). — Paul Panzer, Sep 23 '19 at 22:55

With pairwise summation, how many terms do I need to get an appreciably wrong result?

3 Answers3

Linked