Why is the gcc math library so inefficient?

Question

When I was porting some fortran code to c, it surprised me that the most of the execution time discrepancy between the fortran program compiled with ifort (intel fortran compiler) and the c program compiled with gcc, comes from the evaluations of trigonometric functions (sin, cos). It surprised me because I used to believe what this answer explains, that functions like sine and cosine are implemented in microcode inside microprocessors.

In order to spot the problem more explicitly I made a small test program in fortran

program ftest
  implicit none
  real(8) :: x
  integer :: i
  x = 0d0
  do i = 1, 10000000
    x = cos (2d0 * x)
  end do
  write (*,*) x
end program ftest

On intel Q6600 processor and 3.6.9-1-ARCH x86_64 Linux I get with ifort version 12.1.0

$ ifort -o ftest ftest.f90 
$ time ./ftest
  -0.211417093282753     

real    0m0.280s
user    0m0.273s
sys     0m0.003s

while with gcc version 4.7.2 I get

$ gfortran -o ftest ftest.f90 
$ time ./ftest
  0.16184945593939115     

real    0m2.148s
user    0m2.090s
sys     0m0.003s

This is almost a factor of 10 difference! Can I still believe that the gcc implementation of cos is a wrapper around the microprocessor implementation in a similar way as this is probably done in the intel implementation? If this is true, where is the bottle neck?

EDIT

According to comments, enabled optimizations should improve the performance. My opinion was that optimizations do not affect the library functions ... which does not mean that I don't use them in nontrivial programs. However, here are two additional benchmarks (now on my home computer intel core2)

$ gfortran -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m2.993s
user    0m2.986s
sys     0m0.000s

and

$ gfortran -Ofast -march=native -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m2.967s
user    0m2.960s
sys     0m0.003s

Which particular optimizations did you (commentators) have in mind? And how can compiler exploit a multi-core processor in this particular example, where each iteration depends on the result of the previous one?

EDIT 2

The benchmark tests of Daniel Fisher and Ilmari Karonen made me think that the problem might be related to the particular version of gcc (4.7.2) and maybe to a particular build of it (Arch x86_64 Linux) that I am using on my computers. So I repeated the test on the intel core i7 box with debian x86_64 Linux, gcc version 4.4.5 and ifort version 12.1.0

$ gfortran -O3 -o ftest ftest.f90
$ time ./ftest
  0.16184945593939115     

real    0m0.272s
user    0m0.268s
sys     0m0.004s

and

$ ifort -O3 -o ftest ftest.f90
$ time ./ftest
  -0.211417093282753     

real    0m0.178s
user    0m0.176s
sys     0m0.004s

For me this is a very much acceptable performance difference, which would never make me ask this question. It seems that I will have to ask on Arch Linux forums about this issue.

However, the explanation of the whole story is still very welcome.

It is closer to x8, got a machine with 8 cores? Auto-vectorizing ought to play a role as well, but you disabled that. Benchmarking code without turning on the optimizer is a waste of your time. — Hans Passant, Dec 14 '12 at 12:04
Also, tell gcc which core you want to target, otherwise it'll assume you want something ancient: `-mtune=core2` — ams, Dec 14 '12 at 12:34
@chill It's a chaotic map: small numerical errors blow up exponentially fast. This is not a concern. — Benjamin Batistic, Dec 14 '12 at 12:49
@HansPassant @ams : the program can't be executed in parallel. I can't see how would optimization affect the execution of `cos`. Of course I tried with optimizations, but there is no difference. — Benjamin Batistic, Dec 14 '12 at 12:56
The reason optimization might have mattered is that gcc will sometimes substitute library calls with other things (either other library calls, or inline code) when optimizing. — Catfish_Man, Dec 14 '12 at 19:20
Without optimization, values will be stored and fetched from memory, while with optimization they're all in registers. Even with CPU cache, that alone might explain the difference... Try looking at asm output of compiler if you want to get to the bottom of this. — hyde, Dec 14 '12 at 20:04
Re: your edit, that's odd. On my home computer (AMD Athlon X2 in 32bit mode), using `-Ofast` cuts the runtime down by about 25%. Also, I'm getting significantly lower runtimes than you (about 0.6 seconds) from GCC, even without optimization. — Ilmari Karonen, Dec 14 '12 at 20:07
Ps. Looking at the disassembly, without optimization GCC seems to call a library function for cos(), whereas with `-Ofast` it compiles it to a single `fcos` instruction. — Ilmari Karonen, Dec 14 '12 at 20:24
There seems to be a lot of negativity about this question but I think it is very interesting - I have also noticed extremely poor performance for `log` functions in gcc/gfortran compared to Intel or ACML (with all optimizations enabled). I just put it down to something you have to live with with gcc... — robince, Dec 14 '12 at 20:59
@robince not a single downvote -> "a lot of negativity"?! If I didn't know any better, I'd guess you were new here. — sehe, Dec 14 '12 at 21:15
The speed differences are inverted between gfortran and ifort here, but seem to mostly do with loop performance. http://stackoverflow.com/questions/8893192/puzzling-performance-difference-between-ifort-and-gfortran I'd retry the gfortran with -O3 and see if the time comes down. — Digikata, Dec 14 '12 at 21:44
@IlmariKaronen I have checked the assembly code of optimized and unoptimized GCC versions and it seems that both call the same cos function in my case. I dissasambled the intel version, which appears to be far more complex: 128908 lines of intel asm vs 251 lines of gcc asm. I must admit that I am not able to read and understand asm... — Benjamin Batistic, Dec 14 '12 at 21:52
The number of asm lines seems to point towards ifort possibly doing a loop unroll (and maybe inlining) optimization and gfortran doing unoptimized looping. iirc -Ofast doesn't call out anything but a different floating point handling, whereas an -O3 will apply many more optimizations, including loop unrolling. I'd recompile with gfortran using the -O3 flag and recheck the speeds. — Digikata, Dec 14 '12 at 22:21
@sehe - well I was refering mainly to the vote to close and the tone of some of the first comments. Not too new here but generally I find it a pretty friendly and supportive environment. On topic - I really think it is not optimisation flags or loop unrolling. From my experience it is a self-evident fact that gcc trig functions / exp and log are significantly slower than ifort / acml in all cases (for me the proof of this is the perfomance increase that comes from replacing gcc log with acml log, with all other optimisations constant). I don't remember the numbers but for me it a big increase. — robince, Dec 14 '12 at 22:39
@Digikata, `ifort` doesn't perform any kind of loop unrolling. It translates the loop very literally and calls into its own optimised `cos` routine, which somehow manages to perform faster cosine computation than `fcos`. On a cursory glance, the disassembled `cos` routine looks like using tabulated argument reduction. GCC calls into the system-wide math lib, that implements something similar, but not as optimised for Intel processors as `libimf` does. — Hristo Iliev, Dec 14 '12 at 23:06
My gfortran (gcc 4.6.2) takes 0.39 seconds without optimisations, 0.29 seconds with -O3. I don't have an ifort to compare, but 30-40 nanoseconds for a cosine don't seem excessively inefficient. — Daniel Fischer, Dec 14 '12 at 23:34
@DanielFischer Can you be more specific about your system : CPU power, platform ... and gcc: did you install the precompiled binaries, if yes, from which repository ...? Do you think I have a problem with the quality of my gcc (glibc) binaries, precompiled for ARCH x68_64 Linux? — Benjamin Batistic, Dec 15 '12 at 08:56
Core i5 2410M, 2310MHz, openSuSE 12.1, 64 bit, gcc from the distribution. Could be that Arch's gcc has been built suboptimally, but that's lower-level than my ken. — Daniel Fischer, Dec 15 '12 at 09:26
@HristoIliev Is what you wrote consistent with [explanation](http://stackoverflow.com/a/2284932/1809545) I am mentioning in my post? — Benjamin Batistic, Dec 15 '12 at 12:57
The gcc math lib is slow, depends what version of gcc your using. — SpagnumMoss, Dec 15 '12 at 15:28
@BenjaminBatistic, the explanation states that compilers use `fcos`. My impression is that the opposite happens: both GCC and `ifort` evade `fcos` and other x87 trigonometric instructions as much as possible and call into their own math libraries in 64-bit mode. Have to dig into this whenever time permits. — Hristo Iliev, Dec 15 '12 at 22:49

score 18 · Accepted Answer · answered Dec 15 '12 at 16:10

18

Most of this is due to differences in the math library. Some points to consider:

Yes, the x86 processors with the x87 unit has fsin and fcos instructions. However, they are implemented in microcode, and there is not particular reason why they must be faster than a pure software implementation.
GCC does not have it's own math library, but rather uses the system provided one. On Linux this is typically provided by glibc.
32-bit x86 glibc uses fsin/fcos.
x86_64 glibc uses software implementations using the SSE2 unit. For a long time, this was a lot slower than the 32-bit glibc version which just used the x87 instructions. However, improvements have (somewhat recently) been made, so depending on which glibc version you have the situation might not be as bad anymore as it used to be.
The Intel compiler suite is blessed with a VERY fast math library (libimf). Additionally, it includes vectorized transcendental math functions, which can often further speed up loops with these functions.

answered Dec 15 '12 at 16:10

janneb

36,249
2
81
97

1

`fsin` and `fcos` *are* faster than implementations in C, but they are [fairly imprecise](https://randomascii.wordpress.com/2014/10/09/intel-underestimates-error-bounds-by-1-3-quintillion/) as they are based on a wrong value of pi (sic!). – fuz Dec 13 '15 at 18:29
@fuz: It's not exactly a *wrong* value, but it's "only" 80-bit `long double` precision. (64-bit significand, and the next 2 digits of the exact value happen to be `0` so 66-bit significand). `fsin` range reduction uses Pi correctly rounded to an 80-bit `long double`. But yes, this is not good enough near `+-Pi` where it leads to catastrophic cancellation, but doing better with software requires extended precision. (e.g. `double double` I guess) – Peter Cordes Aug 06 '19 at 01:40
Also related: [Calling fsincos instruction in LLVM slower than calling libc sin/cos functions?](//stackoverflow.com/q/12485190) shows a case where software math libs are faster than `fsincos`, not slower. – Peter Cordes Aug 06 '19 at 01:47

Why is the gcc math library so inefficient?

1 Answers1

Linked

Related