std::tan() extremely slow after updating glibc

Question

I have a C++ program that calls lots of trig functions. It has been running fine for more than a year. I recently installed gcc-4.8, and in the same go, updated glibc. This resulted in my program slowing down by almost a factor x1000. Using gdb I discovered that the cause of the slowdown was a call to std::tan(). When the argument is either pi or pi/2, the function takes very long to return.

Here's an MWE that reproduces the problem if compiled without optimization (the real program has the same problem both with and without the -O2 flag).

#include <cmath>

int main() {
    double pi = 3.141592653589793;
    double approxPi = 3.14159;
    double ret = 0.;

    for(int i = 0; i < 100000; ++i) ret = std::tan(pi); //Very slow
    for(int i = 0; i < 100000; ++i) ret = std::tan(approxPi); //Not slow
}

Here's a sample backtrace from gdb (obtained after interrupting the program randomly with Ctrl+c). Starting from the call to tan, the backtrace is the same in the MWE and my real program.

#0  0x00007ffff7b1d048 in __mul (p=32, z=0x7fffffffc740, y=0x7fffffffcb30, x=0x7fffffffc890) at ../sysdeps/ieee754/dbl-64/mpa.c:458
#1  __mul (x=0x7fffffffc890, y=0x7fffffffcb30, z=0x7fffffffc740, p=32) at ../sysdeps/ieee754/dbl-64/mpa.c:443
#2  0x00007ffff7b1e348 in cc32 (p=32, y=0x7fffffffc4a0, x=0x7fffffffbf60) at ../sysdeps/ieee754/dbl-64/sincos32.c:111
#3  __c32 (x=<optimized out>, y=0x7fffffffcf50, z=0x7fffffffd0a0, p=32) at ../sysdeps/ieee754/dbl-64/sincos32.c:128
#4  0x00007ffff7b1e170 in __mptan (x=<optimized out>, mpy=0x7fffffffd690, p=32) at ../sysdeps/ieee754/dbl-64/mptan.c:57
#5  0x00007ffff7b45b46 in tanMp (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:503
#6  __tan_avx (x=<optimized out>) at ../sysdeps/ieee754/dbl-64/s_tan.c:488
#7  0x00000000004005b8 in main ()

I've tried running the code (both the MWE and the real program) on four different systems. Two of them are in clusters where I run my code. Two are my laptops. The MWE runs without issues on one of the clusters and one laptop. I checked which version of libm.so.6 each system uses in case that's relevant. The following list shows the system description (taken from cat /etc/*-release), whether the CPU is 32 or 64 bit, whether the MWE is slow, and finally the output of running /lib/libc.so.6 and cat /proc/cpuinfo.

SUSE Linux Enterprise Server 11 (x86_64), 64 bit, using libm-2.11.1.so (MWE is fast)

GNU C Library stable release version 2.11.1 (20100118), by Roland McGrath et al.
Copyright (C) 2009 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Configured for x86_64-suse-linux.
Compiled by GNU CC version 4.3.4 [gcc-4_3-branch revision 152973].
Compiled on a Linux 2.6.32 system on 2012-04-12.
Available extensions:
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
microcode       : 53
cpu MHz         : 1200.000
cache size      : 30720 KB
physical id     : 0
siblings        : 24
core id         : 0
cpu cores       : 12
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 15
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm ida arat epb xsaveopt pln pts dts tpr_shadow vnmi flexpriority ept vpid fsgsbase bmi1 avx2 smep bmi2 erms invpcid
bogomips        : 5000.05
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

CentOS release 6.7 (Final), 64 bit, using libm-2.12.so (MWE is slow)

GNU C Library stable release version 2.12, by Roland McGrath et al.
Copyright (C) 2010 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.4.7 20120313 (Red Hat 4.4.7-16).
Compiled on a Linux 2.6.32 system on 2015-09-22.
Available extensions:
        The C stubs add-on version 2.1.2.
        crypt add-on version 2.1 by Michael Glad and others
        GNU Libidn by Simon Josefsson
        Native POSIX Threads Library by Ulrich Drepper et al
        BIND-8.2.3-T5B
        RT using linux kernel aio
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.gnu.org/software/libc/bugs.html>.

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 26
model name      : Intel(R) Xeon(R) CPU           E5507  @ 2.27GHz
stepping        : 5
cpu MHz         : 1596.000
cache size      : 4096 KB
physical id     : 0
siblings        : 4
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 11
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good xtopology nonstop_tsc aperfmperf pni dtes64 monitor ds_cpl vmx est tm2 ssse3 cx16 xtpr pdcm dca sse4_1 sse4_2 popcnt lahf_lm tpr_shadow vnmi flexpriority ept vpid
bogomips        : 4533.16
clflush size    : 64
cache_alignment : 64
address sizes   : 40 bits physical, 48 bits virtual
power management:

Ubuntu precise (12.04.5 LTS), 64 bit, using libm-2.15.so (my first laptop, MWE is slow)

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.15) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.79 system on 2016-05-26.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model       : 42
model name  : Intel(R) Core(TM) i7-2620M CPU @ 2.70GHz
stepping    : 7
microcode   : 0x1a
cpu MHz     : 800.000
cache size  : 4096 KB
physical id : 0
siblings    : 4
core id     : 0
cpu cores   : 2
apicid      : 0
initial apicid  : 0
fpu     : yes
fpu_exception   : yes
cpuid level : 13
wp      : yes
flags       : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat epb xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid
bogomips    : 5387.59
clflush size    : 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

Ubuntu precise (12.04.5 LTS), 32 bit, using libm-2.15.so (my second laptop, MWE is fast)

GNU C Library (Ubuntu EGLIBC 2.15-0ubuntu10.12) stable release version 2.15, by Roland McGrath et al.
Copyright (C) 2012 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.
There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE.
Compiled by GNU CC version 4.6.3.
Compiled on a Linux 3.2.68 system on 2015-03-26.
Available extensions:
    crypt add-on version 2.1 by Michael Glad and others
    GNU Libidn by Simon Josefsson
    Native POSIX Threads Library by Ulrich Drepper et al
    BIND-8.2.3-T5B
libc ABIs: UNIQUE IFUNC
For bug reporting instructions, please see:
<http://www.debian.org/Bugs/>.

processor    : 0
vendor_id    : GenuineIntel
cpu family    : 6
model        : 15
model name    : Intel(R) Core(TM)2 Duo CPU     T5800  @ 2.00GHz
stepping    : 13
microcode    : 0xa3
cpu MHz        : 800.000
cache size    : 2048 KB
physical id    : 0
siblings    : 2
core id        : 0
cpu cores    : 2
apicid        : 0
initial apicid    : 0
fdiv_bug    : no
hlt_bug        : no
f00f_bug    : no
coma_bug    : no
fpu        : yes
fpu_exception    : yes
cpuid level    : 10
wp        : yes
flags        : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe nx lm constant_tsc arch_perfmon pebs bts aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm lahf_lm dtherm
bogomips    : 3989.79
clflush size    : 64
cache_alignment    : 64
address sizes    : 36 bits physical, 48 bits virtual
power management:

I hope I have managed to provide sufficient background info. These are my questions.

Why did std::tan() turn slow?
Is there a way to restore it to normal speed?

I would very much prefer a solution that does not require installing/replacing a bunch of libraries. That might work on my laptop, but I don't have the necessary permissions on the cluster nodes.

Update #1: I removed my observation about passing constants to tan as it was explained by Sam Varshavchik. I added the output of running /lib/libc.so.6 to my system list. Also added a fourth system. As for timing, here's the output of running time ./mwe with the pi loop (approxPi commented out).

real    0m11.483s
user    0m11.465s
sys 0m0.004s

Here it is with the approxPi loop (pi commented out).

real    0m0.011s
user    0m0.008s
sys 0m0.000s

Update #2: For each system, added whether the CPU is 32 or 64 bit as well as the output of cat /proc/cpuinfo for the first core.

I have a doubt on your statements for the timing ("Very slow/not slow"). For the first 2 loops, you pass a double, so results *can not* be different ! I just can't imagine how the value of the bits would have an impact on timing... Could you explain how you do the timing of these loops ? Maybe add some timer before /after, to get something reliable? — kebs, May 28 '16 at 22:54
Passing a constant to tan() really results in the compiler performing the calculation at compile time. — Sam Varshavchik, May 28 '16 at 22:57
"Fast" and "slow" are not very descriptive..do you have any hard numbers using a profiling tool? Were you able to demonstrate that the issue was the libm version *on the same system*? How did you upgrade glibc, did you do it via your package manager or did you compile it from source (a big no-no). Further, are you suggesting the problem is related to GCC 4.8 or the version of libm? There are too many variables in the problem that I doubt you can get a useful answer. — uh oh somebody needs a pupper, May 28 '16 at 23:02
Please paste the full output of *running* `/lib/libc.so.6` on each system (glibc is magic, it is both a library and a program). — o11c, May 29 '16 at 00:59
@sleeptightpupper I updated the compiler (and glibc since the compiler depends on it) using `sudo add-apt-repository ppa:ubuntu-toolchain-r/test` and `sudo apt-get install gcc-4.8 g++-4.8`. This did _something_ to cause the problem. I don't think it's compiler-related; I get the same slowdown with gcc-4.4, gcc-4.6 and gcc-4.8. I added a fourth system (my other laptop) to my list. It uses libc-2.15.so but does not suffer from any slowdowns. If the problem is with libc, it something other than a plain version number. Maybe there is a clue in the slight difference between the `libc.so.6` outputs. — herr_apa, May 29 '16 at 11:27
I have not demonstrated the issue with two libm versions on the same system. I dare not downgrade (everything depends on glibc, it would probably break my machine) and I'm not savvy enough to install two glibc versions in parallel. @others I have added timing info and edited my question as per your comments. — herr_apa, May 29 '16 at 11:30
could you please post result of `cat /proc/cpuinfo` for *good* and *bad* system? (For only one core, please) — Severin Pappadeux, May 30 '16 at 05:19
@SeverinPappadeux The "second laptop" is 32 bit while the other systems are 64 bit. Added this info together with `cat /proc/cpuinfo`. — herr_apa, May 30 '16 at 09:18
One thing worth considering is if you are compiling with vex encoding (e.g. with `-mavx` or `march=native`). There can be a huge penalty going from AVX to SSE unless you zero the upper part of the register. Glibc is compiled with SSE. GCC should take care of this now but there are cases where it can still be an issue particularly if you're using multiple threads. — Z boson, May 30 '16 at 10:53

score 1 · Answer 1 · answered May 29 '16 at 12:44

Accuracy for transcendental functions (things like trigonometric functions and exponentials) has always a problem¹.

Why some trig function calls are slower than others

For many arguments to trigonometric functions there is a fast approximation that produces a highly accurate result for most arguments. However, for certain arguments the approximation can be quite drastically wrong. As such, more precise methods need to be employed, but these take much longer (as you've noticed).

Why might the new library be slower now

For a long time Intel made misleading claims about the accuracy of it's float versions of trigonmetric functions, saying they were much more accurate than they really were². So much so, that glibc used to just have sin(double) as a wrapper around fsin(float)³. You have likely upgraded to a version of glibc that has rectified this mistake. I can't speak for AMD's libm, but it is likely still relying on incorrect claims of accuracy around the float versions of the trigonometric functions^4,5.

What to do

If you want speed and aren't too fussed about accuracy then use the float version of tan (ftan). Otherwise, if you need accuracy then you're stuck using the slower methods. Best you can do is cache the result of tan(pi) and tan(pi/2) and use the precomputed values when you think you might need them.

While it was nice to quote Bruce@randomascii, it is irrelevant to the issue on hands. From GDB stack trace, line #6, one can check that `__tan_avx` is called. It means *glibc* is configured to use SSE2 (plus AVX, plus AVX2 if applicable) unit for FP math. SSE2 unit does NOT have `fsin` or similar instructions, only IEEE mandated +,-,*,/ and `sqrt`. Trigonometry is typically done via range reduction and Pade approximation. — Severin Pappadeux, May 29 '16 at 16:20
Well, all fair and fine, but there's not much to approximate for tan(0) = 0, tan(±π) = 0, tan(→±π/2) → ±∞ (sign depending on from which side you approach the limit). — datenwolf, May 30 '16 at 09:36

std::tan() extremely slow after updating glibc

1 Answers1

Why some trig function calls are slower than others

Why might the new library be slower now

What to do