Profiling number crunching code: 28% of time in fegetexcept() & optimal compiler flags?

Question

I'm running simulations of DNA strands that involve lots of floating point number crunching. The full code is here: https://github.com/RoaldFre/DNA

I did some profiling with google-perftools after compiling with both gcc and clang. In both cases, google-perftools says that about 28% of the time is spent in fegetexcept(). This appears to be a function of the C library that queries the floating point exception flags of the CPU.

Note that I'm using -ffast-math with gcc, which -- if I'm not mistaken -- should ignore (all?) floating point exceptions! I'm also using -O4 with clang (is there a separate flag to enable unsafe floating point instruction?).

The profiling output for the binary compiled with clang:

Total: 1561 samples
     438  28.1%  28.1%      438  28.1% fegetexcept
     263  16.8%  44.9%      263  16.8% cos
     224  14.3%  59.3%      224  14.3% Vdihedral
     131   8.4%  67.6%      131   8.4% nearestImageVector
     102   6.5%  74.2%      102   6.5% Fexclusion
      70   4.5%  78.7%       70   4.5% Fdihedral
      65   4.2%  82.8%       65   4.2% Fangle
      53   3.4%  86.2%       53   3.4% integratorTaskTick
      46   2.9%  89.2%       46   2.9% nearestImageDistance
      45   2.9%  92.1%       45   2.9% mutiallyExclusivePairForces
      32   2.0%  94.1%       32   2.0% FCoulomb
      24   1.5%  95.6%       24   1.5% forEveryPairD
      17   1.1%  96.7%       17   1.1% calculateForces
      14   0.9%  97.6%       14   0.9% atan2
       8   0.5%  98.1%        8   0.5% Fstack
       8   0.5%  98.7%        8   0.5% pairWrapper
       6   0.4%  99.0%        6   0.4% sin
       3   0.2%  99.2%        3   0.2% __finite
       3   0.2%  99.4%        3   0.2% _init
       2   0.1%  99.6%        2   0.1% acos
       2   0.1%  99.7%        2   0.1% exp
       2   0.1%  99.8%        2   0.1% log
       1   0.1%  99.9%        1   0.1% _IO_file_xsputn
       1   0.1%  99.9%        1   0.1% log2
       1   0.1% 100.0%        1   0.1% significand

The profiling output for the binary compiled with gcc:

Total: 1561 samples
     438  28.1%  28.1%      438  28.1% fegetexcept
     352  22.5%  50.6%      352  22.5% nearestImageVector (inline)
     263  16.8%  67.5%      263  16.8% cos
     131   8.4%  75.8%      131   8.4% measurementTask
      52   3.3%  79.2%      331  21.2% Vbond.isra.1.part.2 (inline)
      50   3.2%  82.4%      562  36.0% dumpStatsSample.2372 (inline)
      46   2.9%  85.3%       46   2.9% nearestImageVector
      42   2.7%  88.0%       42   2.7% FdihedralParticle (inline)
      33   2.1%  90.1%       59   3.8% Vstack.isra.12.part.13 (inline)
      26   1.7%  91.8%       26   1.7% neighbourStackDistance2 (inline)
      16   1.0%  92.8%       16   1.0% die
      14   0.9%  93.7%       44   2.8% VangleP5SB (inline)
      14   0.9%  94.6%       14   0.9% atan2
      13   0.8%  95.5%       88   5.6% Vangle.isra.4.part.5 (inline)
      12   0.8%  96.2%      343  22.0% Vbond.isra.1 (inline)
       8   0.5%  96.7%        8   0.5% printUsage.3788
       6   0.4%  97.1%        6   0.4% Fangle.part.8 (inline)
       6   0.4%  97.5%        6   0.4% boxFromParticle (inline)
       6   0.4%  97.9%      279  17.9% nearestImageDistance (inline)
       6   0.4%  98.3%        6   0.4% sin
       3   0.2%  98.5%        3   0.2% Vdihedral.isra.9 (inline)
       3   0.2%  98.7%        3   0.2% __finite
       3   0.2%  98.8%        3   0.2% _init
       3   0.2%  99.0%        3   0.2% numParticles (inline)
       2   0.1%  99.2%        2   0.1% acos
       2   0.1%  99.3%        8   0.5% addToGrid
       2   0.1%  99.4%        2   0.1% exp
       2   0.1%  99.6%        2   0.1% getAngleBaseInfo (inline)
       2   0.1%  99.7%        2   0.1% log
       1   0.1%  99.7%        1   0.1% Fbond.part.6 (inline)
       1   0.1%  99.8%        1   0.1% _IO_file_xsputn
       1   0.1%  99.9%      563  36.1% dumpStatsSample.2372
       1   0.1%  99.9%        1   0.1% log2
       1   0.1% 100.0%        1   0.1% significand
       0   0.0% 100.0%        6   0.4% Fangle (inline)
       0   0.0% 100.0%        1   0.1% Fbond (inline)
       0   0.0% 100.0%       42   2.7% Fdihedral.part.11 (inline)
       0   0.0% 100.0%        4   0.3% Fstack.part.14 (inline)
       0   0.0% 100.0%       53   3.4% calculateForces.2880
       0   0.0% 100.0%        3   0.2% getKineticTemperature (inline)
       0   0.0% 100.0%        4   0.3% nearestImageDistance2 (inline)

Now, I'm calling quite a bit of functions by function pointers, and gcc was compiling with lto and -O4. I have reason to believe that this possibly caused the profiling output for the gcc binary to be somewhat fluked. For instance, it says that there are 16 samples that are in 'die()'. However, this is impossible, as that function immediately stops the program!

Either way, both binaries seem to agree on the 28% time spent in fegetexcept(). Can I get rid of all these checks?

Secondly, my full compiler optimization flags are as follows:

gcc -march=core2 -O4 -flto -mmmx -msse -msse2 -msse3 -fexcess-precision=fast -ffast-math -finline-limit=2000 -fmerge-all-constants -fmodulo-sched -fmodulo-sched-allow-regmoves -fgcse-sm -fgcse-las -fgcse-after-reload -funsafe-loop-optimizations

and

clang -march=core2 -O4

Is there something I can add to increase performance further? I don't care if the compile time jumps through the roof, I need every bit of performance I can get! (Regarding clang: I couldn't find much specific performance flags there, maybe I should go to llvm bytecode manually and then give flags to the llvm compiler there?)

TL;DR:
(1) Code spends 28% of time in fegetexcept(). Can that be avoided by choosing for 'unsafe floating point code'?
(2) What flags can I pass to gcc and clang to get maximum performance -- even if that sends compile time through the roof?

EDIT

I updated glibc from 2.13-r2 to 2.15-r2, and now the profiling output has changed to:

clang:

Total: 1654 samples
     381  23.0%  23.0%      381  23.0% __asin_finite
     244  14.8%  37.8%      244  14.8% significand
     203  12.3%  50.1%      203  12.3% Vdihedral
     141   8.5%  58.6%      141   8.5% nearestImageVector
     116   7.0%  65.6%      116   7.0% Fexclusion
      81   4.9%  70.5%       81   4.9% integratorTaskTick
      70   4.2%  74.7%       70   4.2% Fangle
      63   3.8%  78.5%       63   3.8% FdihedralParticle
      56   3.4%  81.9%       56   3.4% mutiallyExclusivePairForces
      45   2.7%  84.6%       45   2.7% FCoulomb
      42   2.5%  87.2%       42   2.5% _init
      39   2.4%  89.5%       39   2.4% __isinf
      35   2.1%  91.7%       35   2.1% nearestImageDistance
      29   1.8%  93.4%       29   1.8% __lgamma_r_finite
      21   1.3%  94.7%       21   1.3% forEveryPairD
      16   1.0%  95.6%       16   1.0% Fbond
      13   0.8%  96.4%       13   0.8% __isnan
      11   0.7%  97.1%       11   0.7% __cosh_finite
      10   0.6%  97.7%       10   0.6% Fstack
      10   0.6%  98.3%       10   0.6% __acosh_finite
       9   0.5%  98.9%        9   0.5% pairWrapper
       6   0.4%  99.2%        6   0.4% atan2
       5   0.3%  99.5%        5   0.3% Fdihedral
       5   0.3%  99.8%        5   0.3% calculateForces
       2   0.1%  99.9%        2   0.1% GLIBC_2.15
       1   0.1% 100.0%        1   0.1% exp

gcc:

Total: 1768 samples
     385  21.8%  21.8%      385  21.8% __asin_finite
     275  15.6%  37.3%      275  15.6% significand
     252  14.3%  51.6%      252  14.3% nearestImageVector
     199  11.3%  62.8%      299  16.9% Vdihedral.isra.4.part.5.2808
      55   3.1%  66.0%      902  51.0% FdihedralParticle.2836
      47   2.7%  68.6%      150   8.5% Fexclusion.part.15 (inline)
      44   2.5%  71.1%       87   4.9% FCoulomb.part.16.2891
      36   2.0%  73.1%       36   2.0% _init
      33   1.9%  75.0%      236  13.3% mutiallyExclusivePairForces.2699
      30   1.7%  76.7%       30   1.7% __lgamma_r_finite
      29   1.6%  78.3%       29   1.6% isSaneNumber (inline)
      28   1.6%  79.9%       28   1.6% feelExclusion (inline)
      27   1.5%  81.4%       27   1.5% __isinf
      25   1.4%  82.9%       35   2.0% Fangle.part.11.2855
      22   1.2%  84.1%       40   2.3% Fangle.part.11 (inline)
      22   1.2%  85.4%       24   1.4% randNorm.part.1.3194
      20   1.1%  86.5%       20   1.1% __isnan
      19   1.1%  87.6%       23   1.3% nearestImageUnitVector (inline)
      19   1.1%  88.6%       19   1.1% pairWrapper.3570
      18   1.0%  89.6%      105   5.9% langevinBBKhelper.3161
      17   1.0%  90.6%       23   1.3% Fbond.part.10 (inline)
      15   0.8%  91.5%       20   1.1% Vdihedral.isra.4.part.5 (inline)
      14   0.8%  92.3%       14   0.8% length (inline)
      13   0.7%  93.0%       13   0.7% getBasePairInfo (inline)
      12   0.7%  93.7%      190  10.7% Fexclusion (inline)
      12   0.7%  94.3%       15   0.8% Fstack.part.13 (inline)
      12   0.7%  95.0%       12   0.7% __acosh_finite
      12   0.7%  95.7%       12   0.7% reboxParticles (inline)
       9   0.5%  96.2%       23   1.3% randNorm (inline)
       8   0.5%  96.7%        8   0.5% __cosh_finite
       8   0.5%  97.1%      221  12.5% visitNeighbours.part.1 (inline)
       7   0.4%  97.5%      360  20.4% forEveryPairD
       7   0.4%  97.9%        7   0.4% sincos
       6   0.3%  98.2%     1467  83.0% calculateForces
       5   0.3%  98.5%      943  53.3% Fdihedral.part.12 (inline)
       4   0.2%  98.8%       33   1.9% debugVectorSanity (inline)
       4   0.2%  99.0%       19   1.1% nearestImageDistance (inline)
       3   0.2%  99.2%      317  17.9% Vdihedral.isra.4 (inline)
       3   0.2%  99.3%        3   0.2% getAngleBaseInfo (inline)
       2   0.1%  99.4%        2   0.1% resetForce.2703
       2   0.1%  99.5%        2   0.1% tinymt64_generate_doubleOC (inline)
       2   0.1%  99.7%      223  12.6% visitNeighbours (inline)
       1   0.1%  99.7%       94   5.3% Fangle (inline)
       1   0.1%  99.8%        1   0.1% calcInvDebyeLength (inline)
       1   0.1%  99.8%        1   0.1% forEveryParticle
       1   0.1%  99.9%      131   7.4% forEveryParticleD
       1   0.1%  99.9%        1   0.1% munmap
       1   0.1% 100.0%        1   0.1% neighbourStackDistance2 (inline)
       0   0.0% 100.0%        1   0.1% 0x3e1341e250d56f1d
       0   0.0% 100.0%       23   1.3% Fbond (inline)
       0   0.0% 100.0%     1690  95.6% __libc_start_main
       0   0.0% 100.0%      380  21.5% forEveryPair (inline)
       0   0.0% 100.0%     1689  95.5% integratorTaskTick.3198
       0   0.0% 100.0%     1690  95.6% main
       0   0.0% 100.0%     1690  95.6% run (inline)
       0   0.0% 100.0%     1689  95.5% seqTick.2114
       0   0.0% 100.0%        1   0.1% taskStop (inline)
       0   0.0% 100.0%     1689  95.5% taskTick (inline)

So it looks like that fegetexcept was probably just a wrong name that got identified with some piece of code from the glibc math routines. I guess that that's a shortcoming of google-perftools?

Part (2) of my question still stands, though: What flags can I pass to gcc and clang to get maximum performance -- even if that sends compile time through the roof?

EDIT2

Using 'perf' (see, for instance https://stackoverflow.com/a/10958510/153105 ) gives a decent profiling output. It looks like most of the time is spent in atan2() and cos(), and the sse2 versions are used. I'll add the output for completeness:

# Events: 17K cycles
#
# Overhead  Command         Shared Object                                                   Symbol
# ........  .......  ....................  .......................................................
#
    21.67%  hairpin  libm-2.15.so          [.] __ieee754_atan2_sse2                               
    14.12%  hairpin  hairpin               [.] nearestImageVector                                 
    13.94%  hairpin  libm-2.15.so          [.] __cos_sse2                                         
    11.94%  hairpin  hairpin               [.] Vdihedral.isra.4.part.5.2808                       
     8.27%  hairpin  hairpin               [.] mutiallyExclusivePairForces.2699                   
     4.81%  hairpin  hairpin               [.] calculateForces                                    
     4.45%  hairpin  hairpin               [.] FdihedralParticle.2836                             
     3.89%  hairpin  hairpin               [.] FCoulomb.part.16.2891                              
     2.17%  hairpin  hairpin               [.] langevinBBKhelper.3161                             
     1.85%  hairpin  hairpin               [.] Fangle.part.11.2855                                
     1.83%  hairpin  libc-2.15.so          [.] __isinf                                            
     1.64%  hairpin  hairpin               [.] randNorm.part.1.3194                               
     1.45%  hairpin  libm-2.15.so          [.] __ieee754_log_sse2                                 
     1.02%  hairpin  hairpin               [.] forEveryPairD                                      
     0.93%  hairpin  libm-2.15.so          [.] __ieee754_acos_sse2                                
     0.76%  hairpin  hairpin               [.] pairWrapper.3570                                   
     0.76%  hairpin  hairpin               [.] __isnan@plt                                        
     0.74%  hairpin  libc-2.15.so          [.] __isnan                                            
     0.68%  hairpin  hairpin               [.] __isinf@plt                                        
     0.59%  hairpin  libm-2.15.so          [.] __ieee754_exp_sse2                                 
     0.58%  hairpin  libm-2.15.so          [.] __sincos                                           
     0.55%  hairpin  hairpin               [.] integratorTaskTick.3198                            
     0.29%  hairpin  hairpin               [.] __atan2_finite@plt                                 
     0.23%  hairpin  hairpin               [.] cos@plt                                            
     0.19%  hairpin  libm-2.15.so          [.] csloww1                                            
     0.07%  hairpin  hairpin               [.] resetForce.2703                                    
     0.07%  hairpin  hairpin               [.] forEveryParticle                                   
     0.06%  hairpin  libm-2.15.so          [.] __dubcos                                           
     0.05%  hairpin  [kernel.kallsyms]     [k] mutex_unlock                                       
     0.02%  hairpin  hairpin               [.] __log_finite@plt                                   
     0.02%  hairpin  hairpin               [.] forEveryParticleD                                  
     0.02%  hairpin  [kernel.kallsyms]     [k] do_raw_spin_lock                                   
     0.02%  hairpin  hairpin               [.] __acos_finite@plt                                  
     0.02%  hairpin  [kernel.kallsyms]     [k] update_cpu_load                                    
     0.01%  hairpin  [kernel.kallsyms]     [k] tick_sched_timer                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] ktime_get                                          
     0.01%  hairpin  hairpin               [.] __exp_finite@plt                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] run_timer_softirq                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] apic_timer_interrupt                               
     0.01%  hairpin  [kernel.kallsyms]     [k] __cycles_2_ns                                      
     0.01%  hairpin  [kernel.kallsyms]     [k] __local_bh_enable                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] intel_pmu_disable_all                              
     0.01%  hairpin  [kernel.kallsyms]     [k] r100_mm_rreg                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] perf_adjust_freq_unthr_context                     
     0.01%  hairpin  [kernel.kallsyms]     [k] update_stats_wait_end.clone.15                     
     0.01%  hairpin  [kernel.kallsyms]     [k] ttwu_do_activate.clone.50                          
     0.01%  hairpin  [kernel.kallsyms]     [k] do_signal                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] tty_hung_up_p                                      
     0.01%  hairpin  hairpin               [.] main                                               
     0.01%  hairpin  [kernel.kallsyms]     [k] prepare_signal                                     
     0.01%  hairpin  libprofiler.so.0.3.0  [.] ProfileData::Evict(ProfileData::Entry const&)      
     0.01%  hairpin  [kernel.kallsyms]     [k] uhci_check_ports                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] copy_siginfo_to_user                               
     0.01%  hairpin  [kernel.kallsyms]     [k] fxrstor_checking                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] calc_global_load                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] account_group_user_time                            
     0.01%  hairpin  [kernel.kallsyms]     [k] tg_load_down                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] irq_enter                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] __schedule                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] n_tty_write                                        
     0.01%  hairpin  libprofiler.so.0.3.0  [.] ProfileHandler::SignalHandler(int, siginfo*, void*)
     0.01%  hairpin  [kernel.kallsyms]     [k] get_cycles                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] enqueue_hrtimer                                    
     0.01%  hairpin  hairpin               [.] seqTick.2114                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] idle_cpu                                           
     0.01%  hairpin  hairpin               [.] sincos@plt                                         
     0.01%  hairpin  [kernel.kallsyms]     [k] tick_program_event                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] clear_page_c                                       
     0.01%  hairpin  [kernel.kallsyms]     [k] number.clone.1                                     
     0.01%  hairpin  [kernel.kallsyms]     [k] task_waking_fair                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] save_i387_xstate                                   
     0.01%  hairpin  [kernel.kallsyms]     [k] __rcu_pending                                      
     0.01%  hairpin  [kernel.kallsyms]     [k] jiffies_to_timeval                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] iowrite16                                          
     0.01%  hairpin  [kernel.kallsyms]     [k] hrtimer_interrupt                                  
     0.01%  hairpin  [kernel.kallsyms]     [k] finish_task_switch                                 
     0.01%  hairpin  [kernel.kallsyms]     [k] clockevents_program_event                          
     0.01%  hairpin  [kernel.kallsyms]     [k] ioread16                                           
     0.01%  hairpin  [kernel.kallsyms]     [k] lapic_next_event                                   
     0.00%  hairpin  [kernel.kallsyms]     [k] read_tsc                                           
     0.00%  hairpin  [kernel.kallsyms]     [k] __zone_watermark_ok                                
     0.00%  hairpin  libpthread-2.15.so    [.] __libc_read                                        
     0.00%  hairpin  [kernel.kallsyms]     [k] intel_pmu_enable_all

You might want to check [*this post*](http://scicomp.stackexchange.com/a/2719/1262). It's evident that you're seeing only self-time, and that doesn't tell you much. — Mike Dunlavey, Sep 15 '12 at 16:52
MikeDunlavey: Thanks! I'll check that post right away! nneonneo: I'm running Gentoo Linux. GCC version 4.6.3. Clang version 3.1-r5 (llvm version 3.1-r2). — RoaldFre, Sep 15 '12 at 16:59
I checked this with `gcc 4.4.5` and find no calls to `fegetexcept()` in `oprofile`'s log, so far ..- it's still running. (I had to remove `-fexcess-precision=fast` and `-flto`) as `gcc` complained to not knowing them.) — alk, Sep 15 '12 at 17:26
The listing provided by your 2nd edit corresponds quiet well to what I see in `oprofile`'s output, here locally. (I'm using the Debian Stable set of tools, btw.) — alk, Sep 16 '12 at 08:52

Bruce Dawson · Answer 1 · 2015-02-24T05:30:43.050

You should use perf and Brendan Gregg's scripts to create a flame graph so that you can get a nice visual representation of where time is going. A flame graph will make it obvious which functions are fegetexcept because it is a way of visualizing and summarizing call stacks:

http://www.brendangregg.com/FlameGraphs/cpuflamegraphs.html

CPU sampling without call stacks is usually useless.

Make sure you have symbols installed for everything since many profilers will associate samples with the nearest exported name which can lead to gross errors -- that could account for why 'die' is showing some samples.

You could also try loading your program into gdb and setting a breakpoint on fegetexcept. If you have both the symbols and source-code installed for libc then you can walk up the call stack and see why fegetexcept is being called. My guess is that you are passing in out-of-range values to acos, or something like that.

This article talks about how to install symbols and source for libc.

https://randomascii.wordpress.com/2013/01/08/symbols-on-linux-part-one-g-library-symbols/

+ I'm agreeing with you except for the flame-graph part. Check Point 2 of [*this post*](http://stackoverflow.com/a/27867426/23771). Briefly, flame graphs don't show time-hogs if they are called from multiple different paths. — Mike Dunlavey, Feb 24 '15 at 17:37
I'd say flame graphs are still well worth trying. Even if the calls to fegetexcept are coming from many paths the OP will be able to find these paths through careful inspection. That's a good post on flame graphs. I normally use ETW/xperf/WPA which has separate graphs for exploring busy and idle times, including showing all paths to a particular function. But that's Windows only. — Bruce Dawson, Feb 24 '15 at 20:26

Stephen Canon · Answer 2 · 2012-09-15T16:40:17.513

0

Your sources don't contain any calls to fegetexcept( ), which means it's being invoked by some library function you use. Probably one or more functions in the system math library, to judge by the rest of your sampling data.

Can you try adding -fno-math-errno? On some platforms, this will help avoid needless FP environment operations.

edited Sep 15 '12 at 16:40

answered Sep 15 '12 at 16:10

Stephen Canon

103,815
19
183
269

The flag -fno-math-errno is implied by -ffast-math, which I enabled for gcc. I've double checked it by explicitly giving the -fno-math-errno as well, but the result is the same. I assume that fegetexcept() is called by the math glibc functions that I'm using (mostly cos and atan2). – RoaldFre Sep 15 '12 at 16:56
From the new traces after you upgraded glibc, I conclude that the old version likely didn't support optimizing for -fno-math-errno. – Stephen Canon Sep 17 '12 at 14:09
@RoaldFre, there was a [bug in clang](http://llvm-reviews.chandlerc.com/rL204742) related to fast-math. Try to re-collect your data with a new clang/llvm. – Zinovy Nis Apr 01 '14 at 12:11

Profiling number crunching code: 28% of time in fegetexcept() & optimal compiler flags?

EDIT

EDIT2

2 Answers2