Why don't LLVM passes optimize floating point instructions?

Question

See above. I wrote to sample functions:

source.ll:

define i32 @bleh(i32 %x) {
entry:
  %addtmp = add i32 %x, %x
  %addtmp1 = add i32 %addtmp, %x
  %addtmp2 = add i32 %addtmp1, %x
  %addtmp3 = add i32 %addtmp2, %x
  %addtmp4 = add i32 %addtmp3, 1
  %addtmp5 = add i32 %addtmp4, 2
  %addtmp6 = add i32 %addtmp5, 3
  %multmp = mul i32 %x, 3
  %addtmp7 = add i32 %addtmp6, %multmp
  ret i32 %addtmp7
}

source-fp.ll:

define double @bleh(double %x) {
entry:
  %addtmp = fadd double %x, %x
  %addtmp1 = fadd double %addtmp, %x
  %addtmp2 = fadd double %addtmp1, %x
  %addtmp3 = fadd double %addtmp2, %x
  %addtmp4 = fadd double %addtmp3, 1.000000e+00
  %addtmp5 = fadd double %addtmp4, 2.000000e+00
  %addtmp6 = fadd double %addtmp5, 3.000000e+00
  %multmp = fmul double %x, 3.000000e+00
  %addtmp7 = fadd double %addtmp6, %multmp
  ret double %addtmp7
}

Why is it that when I optimize both functions using

opt -O3 source[-fp].ll -o opt.source[-fp].ll -S

that the i32 one gets optimized but the double one doesn't? I expected the fadd to get combined to a single fmul. Instead it looks exactly the same.

Is it due to the flags being set differently? I am aware of certain optimizations that are possible for i32 that are not doable for double. But the absence of simple constant folding is beyond my understanding.

I am using LLVM 3.1.

Highly relevant, though I'm not sure if it's a duplicate: [Why doesn't GCC optimize a*a*a*a*a*a to (a*a*a)*(a*a*a)?](http://stackoverflow.com/q/6430448/395760) — , Aug 13 '12 at 21:33
@delan This, like a ton of similar floating-point questions, is indeed a duplicate of that. The answer is the same even if the details of the question vary. Any good answer to this question would point out the non-associativity of floating-point arithmetic and mention -ffast-math, just like the accepted answer of that question does. — Pascal Cuoq, Aug 13 '12 at 21:41
Thank you both. An answer of the linked question brought up http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html and emphasizes its section on ambiguity. — f00id, Aug 13 '12 at 21:44

score 7 · Accepted Answer · answered Aug 13 '12 at 21:56

It's not quite true to say that no optimization is possible. I'll go through the first few lines to show where transformations are and are not allowed:

  %addtmp = fadd double %x, %x

This first line could safely be transformed to fmul double %x 2.0e+0, but that's not actually an optimization on most architectures (fadd is generally as fast or faster than fmul, and doesn't require producing the constant 2.0). Note that barring overflow, this operation is exact (like all scaling by powers of two).

  %addtmp1 = fadd double %addtmp, %x

This line could be transformed to fmul double %x 3.0e+0. Why is this a legal transformation? Because the computation that produced %addtmp was exact, so only a single rounding is been incurred whether this is computed as x * 3 or x + x + x. Because these are IEEE-754 basic operations and therefore correctly rounded, the result is the same either way. What about overflow? Neither may overflow unless the other does as well.

  %addtmp2 = fadd double %addtmp1, %x

This is the first line that cannot be legally transformed into constant * x. 4 * x would compute exactly, without any rounding, whereas x + x + x + x incurs two roundings: x + x + x is rounded once, then adding x may round a second time.

  %addtmp3 = fadd double %addtmp2, %x

Ditto here; 5 * x would incur one rounding; x + x + x + x + x incurs three.

The only line that might be beneficially transformed would be replacing x + x + x with 3 * x. However, the subexpression x + x is already present elsewhere, so an optimizer easily could choose not to employ this transform (since it can take advantage of the existing partial result if it does not).

Thank you for that detailed answer. So doing a `fmul double %x 2.0e+0` and constant propagating the result is still slower than repeating `fadd`s? Or am I missing rounding issues? — f00id, Aug 13 '12 at 22:05
`2*x` and `x+x` are numerically equivalent; however, on common architectures, `x+x` is no slower than (and sometimes faster than) `2*x`, so there's generally no reason for an optimizer to use `2*x`. — Stephen Canon, Aug 13 '12 at 22:10
But calculating `x+x` twice (and then adding `x`) seems slower than doing something like `%y1 = fadd double %x, %x`, `%y2 = fadd double %y1, %y1`, `%res = fadd double %y2, %x`. I didn't want to emphasize `fmul` which was probably misleading. — f00id, Aug 13 '12 at 22:41

Why don't LLVM passes optimize floating point instructions?

1 Answers1

Linked