I'm running simulations of DNA strands that involve lots of floating point number crunching. The full code is here: https://github.com/RoaldFre/DNA
I did some profiling with google-perftools after compiling with both gcc and clang. In both cases, google-perftools says that about 28% of the time is spent in fegetexcept(). This appears to be a function of the C library that queries the floating point exception flags of the CPU.
Note that I'm using -ffast-math with gcc, which -- if I'm not mistaken -- should ignore (all?) floating point exceptions! I'm also using -O4 with clang (is there a separate flag to enable unsafe floating point instruction?).
The profiling output for the binary compiled with clang:
Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
263 16.8% 44.9% 263 16.8% cos
224 14.3% 59.3% 224 14.3% Vdihedral
131 8.4% 67.6% 131 8.4% nearestImageVector
102 6.5% 74.2% 102 6.5% Fexclusion
70 4.5% 78.7% 70 4.5% Fdihedral
65 4.2% 82.8% 65 4.2% Fangle
53 3.4% 86.2% 53 3.4% integratorTaskTick
46 2.9% 89.2% 46 2.9% nearestImageDistance
45 2.9% 92.1% 45 2.9% mutiallyExclusivePairForces
32 2.0% 94.1% 32 2.0% FCoulomb
24 1.5% 95.6% 24 1.5% forEveryPairD
17 1.1% 96.7% 17 1.1% calculateForces
14 0.9% 97.6% 14 0.9% atan2
8 0.5% 98.1% 8 0.5% Fstack
8 0.5% 98.7% 8 0.5% pairWrapper
6 0.4% 99.0% 6 0.4% sin
3 0.2% 99.2% 3 0.2% __finite
3 0.2% 99.4% 3 0.2% _init
2 0.1% 99.6% 2 0.1% acos
2 0.1% 99.7% 2 0.1% exp
2 0.1% 99.8% 2 0.1% log
1 0.1% 99.9% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand
The profiling output for the binary compiled with gcc:
Total: 1561 samples
438 28.1% 28.1% 438 28.1% fegetexcept
352 22.5% 50.6% 352 22.5% nearestImageVector (inline)
263 16.8% 67.5% 263 16.8% cos
131 8.4% 75.8% 131 8.4% measurementTask
52 3.3% 79.2% 331 21.2% Vbond.isra.1.part.2 (inline)
50 3.2% 82.4% 562 36.0% dumpStatsSample.2372 (inline)
46 2.9% 85.3% 46 2.9% nearestImageVector
42 2.7% 88.0% 42 2.7% FdihedralParticle (inline)
33 2.1% 90.1% 59 3.8% Vstack.isra.12.part.13 (inline)
26 1.7% 91.8% 26 1.7% neighbourStackDistance2 (inline)
16 1.0% 92.8% 16 1.0% die
14 0.9% 93.7% 44 2.8% VangleP5SB (inline)
14 0.9% 94.6% 14 0.9% atan2
13 0.8% 95.5% 88 5.6% Vangle.isra.4.part.5 (inline)
12 0.8% 96.2% 343 22.0% Vbond.isra.1 (inline)
8 0.5% 96.7% 8 0.5% printUsage.3788
6 0.4% 97.1% 6 0.4% Fangle.part.8 (inline)
6 0.4% 97.5% 6 0.4% boxFromParticle (inline)
6 0.4% 97.9% 279 17.9% nearestImageDistance (inline)
6 0.4% 98.3% 6 0.4% sin
3 0.2% 98.5% 3 0.2% Vdihedral.isra.9 (inline)
3 0.2% 98.7% 3 0.2% __finite
3 0.2% 98.8% 3 0.2% _init
3 0.2% 99.0% 3 0.2% numParticles (inline)
2 0.1% 99.2% 2 0.1% acos
2 0.1% 99.3% 8 0.5% addToGrid
2 0.1% 99.4% 2 0.1% exp
2 0.1% 99.6% 2 0.1% getAngleBaseInfo (inline)
2 0.1% 99.7% 2 0.1% log
1 0.1% 99.7% 1 0.1% Fbond.part.6 (inline)
1 0.1% 99.8% 1 0.1% _IO_file_xsputn
1 0.1% 99.9% 563 36.1% dumpStatsSample.2372
1 0.1% 99.9% 1 0.1% log2
1 0.1% 100.0% 1 0.1% significand
0 0.0% 100.0% 6 0.4% Fangle (inline)
0 0.0% 100.0% 1 0.1% Fbond (inline)
0 0.0% 100.0% 42 2.7% Fdihedral.part.11 (inline)
0 0.0% 100.0% 4 0.3% Fstack.part.14 (inline)
0 0.0% 100.0% 53 3.4% calculateForces.2880
0 0.0% 100.0% 3 0.2% getKineticTemperature (inline)
0 0.0% 100.0% 4 0.3% nearestImageDistance2 (inline)
Now, I'm calling quite a bit of functions by function pointers, and gcc was compiling with lto and -O4. I have reason to believe that this possibly caused the profiling output for the gcc binary to be somewhat fluked. For instance, it says that there are 16 samples that are in 'die()'. However, this is impossible, as that function immediately stops the program!
Either way, both binaries seem to agree on the 28% time spent in fegetexcept(). Can I get rid of all these checks?
Secondly, my full compiler optimization flags are as follows:
gcc -march=core2 -O4 -flto -mmmx -msse -msse2 -msse3 -fexcess-precision=fast -ffast-math -finline-limit=2000 -fmerge-all-constants -fmodulo-sched -fmodulo-sched-allow-regmoves -fgcse-sm -fgcse-las -fgcse-after-reload -funsafe-loop-optimizations
and
clang -march=core2 -O4
Is there something I can add to increase performance further? I don't care if the compile time jumps through the roof, I need every bit of performance I can get! (Regarding clang: I couldn't find much specific performance flags there, maybe I should go to llvm bytecode manually and then give flags to the llvm compiler there?)
TL;DR:
(1) Code spends 28% of time in fegetexcept(). Can that be avoided by choosing for 'unsafe floating point code'?
(2) What flags can I pass to gcc and clang to get maximum performance -- even if that sends compile time through the roof?
EDIT
I updated glibc from 2.13-r2 to 2.15-r2, and now the profiling output has changed to:
clang:
Total: 1654 samples
381 23.0% 23.0% 381 23.0% __asin_finite
244 14.8% 37.8% 244 14.8% significand
203 12.3% 50.1% 203 12.3% Vdihedral
141 8.5% 58.6% 141 8.5% nearestImageVector
116 7.0% 65.6% 116 7.0% Fexclusion
81 4.9% 70.5% 81 4.9% integratorTaskTick
70 4.2% 74.7% 70 4.2% Fangle
63 3.8% 78.5% 63 3.8% FdihedralParticle
56 3.4% 81.9% 56 3.4% mutiallyExclusivePairForces
45 2.7% 84.6% 45 2.7% FCoulomb
42 2.5% 87.2% 42 2.5% _init
39 2.4% 89.5% 39 2.4% __isinf
35 2.1% 91.7% 35 2.1% nearestImageDistance
29 1.8% 93.4% 29 1.8% __lgamma_r_finite
21 1.3% 94.7% 21 1.3% forEveryPairD
16 1.0% 95.6% 16 1.0% Fbond
13 0.8% 96.4% 13 0.8% __isnan
11 0.7% 97.1% 11 0.7% __cosh_finite
10 0.6% 97.7% 10 0.6% Fstack
10 0.6% 98.3% 10 0.6% __acosh_finite
9 0.5% 98.9% 9 0.5% pairWrapper
6 0.4% 99.2% 6 0.4% atan2
5 0.3% 99.5% 5 0.3% Fdihedral
5 0.3% 99.8% 5 0.3% calculateForces
2 0.1% 99.9% 2 0.1% GLIBC_2.15
1 0.1% 100.0% 1 0.1% exp
gcc:
Total: 1768 samples
385 21.8% 21.8% 385 21.8% __asin_finite
275 15.6% 37.3% 275 15.6% significand
252 14.3% 51.6% 252 14.3% nearestImageVector
199 11.3% 62.8% 299 16.9% Vdihedral.isra.4.part.5.2808
55 3.1% 66.0% 902 51.0% FdihedralParticle.2836
47 2.7% 68.6% 150 8.5% Fexclusion.part.15 (inline)
44 2.5% 71.1% 87 4.9% FCoulomb.part.16.2891
36 2.0% 73.1% 36 2.0% _init
33 1.9% 75.0% 236 13.3% mutiallyExclusivePairForces.2699
30 1.7% 76.7% 30 1.7% __lgamma_r_finite
29 1.6% 78.3% 29 1.6% isSaneNumber (inline)
28 1.6% 79.9% 28 1.6% feelExclusion (inline)
27 1.5% 81.4% 27 1.5% __isinf
25 1.4% 82.9% 35 2.0% Fangle.part.11.2855
22 1.2% 84.1% 40 2.3% Fangle.part.11 (inline)
22 1.2% 85.4% 24 1.4% randNorm.part.1.3194
20 1.1% 86.5% 20 1.1% __isnan
19 1.1% 87.6% 23 1.3% nearestImageUnitVector (inline)
19 1.1% 88.6% 19 1.1% pairWrapper.3570
18 1.0% 89.6% 105 5.9% langevinBBKhelper.3161
17 1.0% 90.6% 23 1.3% Fbond.part.10 (inline)
15 0.8% 91.5% 20 1.1% Vdihedral.isra.4.part.5 (inline)
14 0.8% 92.3% 14 0.8% length (inline)
13 0.7% 93.0% 13 0.7% getBasePairInfo (inline)
12 0.7% 93.7% 190 10.7% Fexclusion (inline)
12 0.7% 94.3% 15 0.8% Fstack.part.13 (inline)
12 0.7% 95.0% 12 0.7% __acosh_finite
12 0.7% 95.7% 12 0.7% reboxParticles (inline)
9 0.5% 96.2% 23 1.3% randNorm (inline)
8 0.5% 96.7% 8 0.5% __cosh_finite
8 0.5% 97.1% 221 12.5% visitNeighbours.part.1 (inline)
7 0.4% 97.5% 360 20.4% forEveryPairD
7 0.4% 97.9% 7 0.4% sincos
6 0.3% 98.2% 1467 83.0% calculateForces
5 0.3% 98.5% 943 53.3% Fdihedral.part.12 (inline)
4 0.2% 98.8% 33 1.9% debugVectorSanity (inline)
4 0.2% 99.0% 19 1.1% nearestImageDistance (inline)
3 0.2% 99.2% 317 17.9% Vdihedral.isra.4 (inline)
3 0.2% 99.3% 3 0.2% getAngleBaseInfo (inline)
2 0.1% 99.4% 2 0.1% resetForce.2703
2 0.1% 99.5% 2 0.1% tinymt64_generate_doubleOC (inline)
2 0.1% 99.7% 223 12.6% visitNeighbours (inline)
1 0.1% 99.7% 94 5.3% Fangle (inline)
1 0.1% 99.8% 1 0.1% calcInvDebyeLength (inline)
1 0.1% 99.8% 1 0.1% forEveryParticle
1 0.1% 99.9% 131 7.4% forEveryParticleD
1 0.1% 99.9% 1 0.1% munmap
1 0.1% 100.0% 1 0.1% neighbourStackDistance2 (inline)
0 0.0% 100.0% 1 0.1% 0x3e1341e250d56f1d
0 0.0% 100.0% 23 1.3% Fbond (inline)
0 0.0% 100.0% 1690 95.6% __libc_start_main
0 0.0% 100.0% 380 21.5% forEveryPair (inline)
0 0.0% 100.0% 1689 95.5% integratorTaskTick.3198
0 0.0% 100.0% 1690 95.6% main
0 0.0% 100.0% 1690 95.6% run (inline)
0 0.0% 100.0% 1689 95.5% seqTick.2114
0 0.0% 100.0% 1 0.1% taskStop (inline)
0 0.0% 100.0% 1689 95.5% taskTick (inline)
So it looks like that fegetexcept was probably just a wrong name that got identified with some piece of code from the glibc math routines. I guess that that's a shortcoming of google-perftools?
Part (2) of my question still stands, though: What flags can I pass to gcc and clang to get maximum performance -- even if that sends compile time through the roof?
EDIT2
Using 'perf' (see, for instance https://stackoverflow.com/a/10958510/153105 ) gives a decent profiling output. It looks like most of the time is spent in atan2() and cos(), and the sse2 versions are used. I'll add the output for completeness:
# Events: 17K cycles
#
# Overhead Command Shared Object Symbol
# ........ ....... .................... .......................................................
#
21.67% hairpin libm-2.15.so [.] __ieee754_atan2_sse2
14.12% hairpin hairpin [.] nearestImageVector
13.94% hairpin libm-2.15.so [.] __cos_sse2
11.94% hairpin hairpin [.] Vdihedral.isra.4.part.5.2808
8.27% hairpin hairpin [.] mutiallyExclusivePairForces.2699
4.81% hairpin hairpin [.] calculateForces
4.45% hairpin hairpin [.] FdihedralParticle.2836
3.89% hairpin hairpin [.] FCoulomb.part.16.2891
2.17% hairpin hairpin [.] langevinBBKhelper.3161
1.85% hairpin hairpin [.] Fangle.part.11.2855
1.83% hairpin libc-2.15.so [.] __isinf
1.64% hairpin hairpin [.] randNorm.part.1.3194
1.45% hairpin libm-2.15.so [.] __ieee754_log_sse2
1.02% hairpin hairpin [.] forEveryPairD
0.93% hairpin libm-2.15.so [.] __ieee754_acos_sse2
0.76% hairpin hairpin [.] pairWrapper.3570
0.76% hairpin hairpin [.] __isnan@plt
0.74% hairpin libc-2.15.so [.] __isnan
0.68% hairpin hairpin [.] __isinf@plt
0.59% hairpin libm-2.15.so [.] __ieee754_exp_sse2
0.58% hairpin libm-2.15.so [.] __sincos
0.55% hairpin hairpin [.] integratorTaskTick.3198
0.29% hairpin hairpin [.] __atan2_finite@plt
0.23% hairpin hairpin [.] cos@plt
0.19% hairpin libm-2.15.so [.] csloww1
0.07% hairpin hairpin [.] resetForce.2703
0.07% hairpin hairpin [.] forEveryParticle
0.06% hairpin libm-2.15.so [.] __dubcos
0.05% hairpin [kernel.kallsyms] [k] mutex_unlock
0.02% hairpin hairpin [.] __log_finite@plt
0.02% hairpin hairpin [.] forEveryParticleD
0.02% hairpin [kernel.kallsyms] [k] do_raw_spin_lock
0.02% hairpin hairpin [.] __acos_finite@plt
0.02% hairpin [kernel.kallsyms] [k] update_cpu_load
0.01% hairpin [kernel.kallsyms] [k] tick_sched_timer
0.01% hairpin [kernel.kallsyms] [k] ktime_get
0.01% hairpin hairpin [.] __exp_finite@plt
0.01% hairpin [kernel.kallsyms] [k] run_timer_softirq
0.01% hairpin [kernel.kallsyms] [k] apic_timer_interrupt
0.01% hairpin [kernel.kallsyms] [k] __cycles_2_ns
0.01% hairpin [kernel.kallsyms] [k] __local_bh_enable
0.01% hairpin [kernel.kallsyms] [k] intel_pmu_disable_all
0.01% hairpin [kernel.kallsyms] [k] r100_mm_rreg
0.01% hairpin [kernel.kallsyms] [k] perf_adjust_freq_unthr_context
0.01% hairpin [kernel.kallsyms] [k] update_stats_wait_end.clone.15
0.01% hairpin [kernel.kallsyms] [k] ttwu_do_activate.clone.50
0.01% hairpin [kernel.kallsyms] [k] do_signal
0.01% hairpin [kernel.kallsyms] [k] tty_hung_up_p
0.01% hairpin hairpin [.] main
0.01% hairpin [kernel.kallsyms] [k] prepare_signal
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileData::Evict(ProfileData::Entry const&)
0.01% hairpin [kernel.kallsyms] [k] uhci_check_ports
0.01% hairpin [kernel.kallsyms] [k] copy_siginfo_to_user
0.01% hairpin [kernel.kallsyms] [k] fxrstor_checking
0.01% hairpin [kernel.kallsyms] [k] calc_global_load
0.01% hairpin [kernel.kallsyms] [k] account_group_user_time
0.01% hairpin [kernel.kallsyms] [k] tg_load_down
0.01% hairpin [kernel.kallsyms] [k] irq_enter
0.01% hairpin [kernel.kallsyms] [k] __schedule
0.01% hairpin [kernel.kallsyms] [k] n_tty_write
0.01% hairpin libprofiler.so.0.3.0 [.] ProfileHandler::SignalHandler(int, siginfo*, void*)
0.01% hairpin [kernel.kallsyms] [k] get_cycles
0.01% hairpin [kernel.kallsyms] [k] enqueue_hrtimer
0.01% hairpin hairpin [.] seqTick.2114
0.01% hairpin [kernel.kallsyms] [k] idle_cpu
0.01% hairpin hairpin [.] sincos@plt
0.01% hairpin [kernel.kallsyms] [k] tick_program_event
0.01% hairpin [kernel.kallsyms] [k] clear_page_c
0.01% hairpin [kernel.kallsyms] [k] number.clone.1
0.01% hairpin [kernel.kallsyms] [k] task_waking_fair
0.01% hairpin [kernel.kallsyms] [k] save_i387_xstate
0.01% hairpin [kernel.kallsyms] [k] __rcu_pending
0.01% hairpin [kernel.kallsyms] [k] jiffies_to_timeval
0.01% hairpin [kernel.kallsyms] [k] iowrite16
0.01% hairpin [kernel.kallsyms] [k] hrtimer_interrupt
0.01% hairpin [kernel.kallsyms] [k] finish_task_switch
0.01% hairpin [kernel.kallsyms] [k] clockevents_program_event
0.01% hairpin [kernel.kallsyms] [k] ioread16
0.01% hairpin [kernel.kallsyms] [k] lapic_next_event
0.00% hairpin [kernel.kallsyms] [k] read_tsc
0.00% hairpin [kernel.kallsyms] [k] __zone_watermark_ok
0.00% hairpin libpthread-2.15.so [.] __libc_read
0.00% hairpin [kernel.kallsyms] [k] intel_pmu_enable_all