floating point multiplication vs repeated addition

Question

Let N be an a compile time unsigned integer.

GCC can optimize

unsigned sum = 0;
for(unsigned i=0; i<N; i++) sum += a; // a is an unsigned integer

to simply a*N. This can be understood since modular arithmetic says (a%k + b%k)%k = (a+b)%k.

However GCC will not optimize

float sum = 0;
for(unsigned i=0; i<N; i++) sum += a;  // a is a float

to a*(float)N.

But by using associative math with e.g. -Ofast I discovered that GCC can reduce this in order log2(N) steps. E.g for N=8 it can do the sum in three additions.

sum = a + a
sum = sum + sum // (a + a) + (a + a)
sum = sum + sum // ((a + a) + (a + a)) + ((a + a) + (a + a))

Though some point after N=16 GCC goes back to doing N-1 sums.

My question is why does GCC not do a*(float)N with -Ofast?

Instead of being O(N) or O(Log(N)) it could be simply O(1). Since N is known at compile time it's possible to determine if N fits in a float. And even if N is too large for a float it could do sum =a*(float)(N & 0x0000ffff) + a*(float)(N & ffff0000). In fact, I did a little test to check the accuracy and a*(float)N is more accurate anyway (see the code and results below).

//gcc -O3 foo.c
//don't use -Ofast or -ffast-math or -fassociative-math
#include <stdio.h>   
float sumf(float a, int n)
{
  float sum = 0;
  for(int i=0; i<n; i++) sum += a;
  return sum;
}

float sumf_kahan(float a, int n)
{
  float sum = 0;
  float c = 0;
  for(int i=0; i<n; i++) {
    float y = a - c;
    float t = sum + y;
    c = (t -sum) - y;
    sum = t;
  }
  return sum;
}  

float mulf(float a, int n)
{
  return a*n;
}  

int main(void)
{
  int n = 1<<24;
  float a = 3.14159;
  float t1 = sumf(a,n);
  float t2 = sumf_kahan(a,n);
  float t3 = mulf(a,n);
  printf("%f %f %f\n",t1,t2,t3);
}

The result is 61848396.000000 52707136.000000 52707136.000000 which shows that multiplication and the Kahan summation have the same result which I think shows that the multiplication is more accurate than the simple sum.

Have you considered that three add ops may be faster than a floating point multiply? This would be consistent with it switching back to a fp multiply at N=16. — Bob, Oct 15 '15 at 15:29
It doesn't make the optimization because the optimization is not valid; in general it produces a different result. Floating point arithmetic does not obey the usual normal arithmetic properties you expect like associativity or the distributive law. Multiplication is not repeated addition. — R.. GitHub STOP HELPING ICE, Oct 15 '15 at 15:40
@R..: He (implicitly) uses `-ffast-math` which *does* allow those kind of optimizations. http://stackoverflow.com/questions/7420665/what-does-gccs-ffast-math-actually-do — Karoly Horvath, Oct 15 '15 at 15:42
You really like to push the limits of compilers don't you? :D That's okay, I do too, but more on the back-end. — Mysticial, Oct 15 '15 at 16:03
Fun. I've found gcc *vectorizes* the sum when the "multiplier" is a function argument. — EOF, Oct 15 '15 at 16:10
If `-Ofast` implies `-ffast-math`, then this question mixes `-ffast-math` and Kahan summation, which is not a good recipe (Kahan summation is the prototypal example of code that must not be compiled with non-compliant optimizations). — Pascal Cuoq, Oct 15 '15 at 23:50
@PascalCuoq, in the test for accuracy I did not use `-Ofast` or associative math I just used `-O3` or no optimization at all. — Z boson, Oct 16 '15 at 08:20
@Ian, the compiler never switches to multiplication. That's the point. But I do agree with you that there for some small values that `O(log(n)` algorithm could be faster than the `O(1)`. — Z boson, Oct 16 '15 at 08:25
@Mysticial, I was inspired by this question by another question on SO where the OP had a complicated formula which turn out to be a constant in the loop which he did not realize. He wanted to optimize the loop. Doing `a*float(N)` would have been a lot faster and even more precise anyway. — Z boson, Oct 16 '15 at 08:27
@PascalCuoq, I think I see where some confusion could be. I did not state the results of that code. It's `61848396.000000 52707136.000000 52707136.000000`. So the multiplication and the Kahan sum get the same result.That's what I meant by the multiplication being more accurate than the simple sum. — Z boson, Oct 16 '15 at 10:46
FWIW, clang does this optimization under -ffast-math. Fascinatingly, GCC will do it if N is *exactly* 4 (http://goo.gl/ZcERor) ... this shouldn't require fast-math because x+x+x+x always equals 4*x in default rounding, but apparently they special-cased this operation only under fast-math. — Stephen Canon, Oct 16 '15 at 12:53
@StephenCanon, woah, Clang wins again! Sorta, I tried N = 64 and it does mult but by N=128 Clang stops doing the mult. — Z boson, Oct 16 '15 at 12:59
@Zboson: Yeah, it seems I'll be filing multiple clang bugs related to this. — Stephen Canon, Oct 16 '15 at 13:01
@StephenCanon it's not obvious to me why "x+x+x+x always equals 4*x in default rounding" (this is why you have the floating-point tag and I don't). Are there other special cases like this? I think I though sum(x) for powers of 2 would be the same as `2^n*x` but I thought I tried that and it did not work for some larger values of n. I guess I could ask a question about it but maybe the answer is obvious in one or two lines. — Z boson, Oct 16 '15 at 13:51
@Zboson: It's highly non-obvious (even Kahan was shocked when I told him this fact, though he immediately understood why it was true once I told him). The simplest way to see that it's true is probably by exhaustively checking the rounding for all possible trailing bit patterns of x, if you're curious. A deeper proof can probably be reached by proving that (2^n - 1)x + x = 2^n x if 2^n-1 is representable, and then note that x + x + x = 3x. The property holds for 2, 3, 4, and 5; 6 is the first n for which it fails. — Stephen Canon, Oct 16 '15 at 13:58
@EvgeniyZh, thank you for your suggestion. I submitted a bug report to GCC. This is my first bug submission. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68105 — Z boson, Oct 26 '15 at 19:58
@EvgeniyZh, I'm not sure I reported the error to the right component. I said "c" "A problem with the C compiler front end". But I am not sure this is really a front end issue. Where should I put this "bug"? Should I use "middle-end" "GCC's middle end; folks, expand_*, etc. Target dependent parts and optimization passes have their own component"? — Z boson, Oct 26 '15 at 20:21
@Zboson I believe the bug belongs into one of optimization components: ipa, tree- or rtl-optimization. I'm not pro it gcc structure though, to tell you for sure. I suggest to a send a question to the mailing list, also that may attract developers attention to the problem — EvgeniyZh, Oct 26 '15 at 20:37
float has only 23 bits of mantissa (including 1 implicit bit), but unsigned int has 32 bit mantissa, which can be concluded that when i > (2^24), all the addition sum+=a will not have any effect, because at that time "a" is 23 bits after the current "sum" in the binary representation. That might be the reason — DU Jiaen, Nov 13 '15 at 15:23

score 2 · Answer 1 · answered Nov 07 '15 at 11:31

There are some fundamental difference between

 float funct( int N, float sum )
 {
     float value = 10.0;
     for( i = 0; i < N ;i ++ ) {
         sum += value;
     }
     return sum;
 }

and

float funct( int N, float sum )
{
    float value = 10.0;
    sum += value * N;
    return sum;
}

When the sum approaches FLT_EPSILON * larger than value, the repeated sum tends towards a no-op. So any large value of N, would result in no change to sum for repeated addition. For the multiplication choice, the result (value * N) needs to be FLT_EPSILON * smaller than sum for the operation to have a no-op.

So the compiler can't make the optimization, because it can't tell if you wanted the exact behavior (where multiply is better), or the implemented behavior, where the scale of sum affects the result of the addition.

floating point multiplication vs repeated addition

1 Answers1

Linked