1

I have some fortran code, that when compiled with gfortran is faster than when compiled with ifort. I usually find on the internet about the opposite case...

I tried to run intel vtune to identify different hotspots between the executables, but I couldn't manage to solve these.

I'm not sure what can cause this difference. Here is the perf output:

gfortran:

Performance counter stats for 'build/gnuRelease/NBODY inputFile temp' (10 runs):

      2,489.36 msec task-clock:u              #    0.986 CPUs utilized            ( +-  0.21% )
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
           589      page-faults:u             #    0.237 K/sec                    ( +-  0.05% )
10,678,130,527      cycles:u                  #    4.290 GHz                      ( +-  0.20% )
31,102,858,644      instructions:u            #    2.91  insn per cycle           ( +-  0.00% )
 3,537,572,458      branches:u                # 1421.078 M/sec                    ( +-  0.00% )
       566,054      branch-misses:u           #    0.02% of all branches          ( +-  5.14% )

        2.5235 +- 0.0150 seconds time elapsed  ( +-  0.59% )

ifort:

 Performance counter stats for 'build/ifortRelease/NBODY inputFile temp' (10 runs):

      2,834.44 msec task-clock:u              #    0.978 CPUs utilized            ( +-  0.14% )
             0      context-switches:u        #    0.000 K/sec                  
             0      cpu-migrations:u          #    0.000 K/sec                  
         2,600      page-faults:u             #    0.917 K/sec                    ( +-  0.01% )
12,146,500,211      cycles:u                  #    4.285 GHz                      ( +-  0.14% )
36,441,911,065      instructions:u            #    3.00  insn per cycle           ( +-  0.00% )
 2,936,917,079      branches:u                # 1036.154 M/sec                    ( +-  0.00% )
       339,226      branch-misses:u           #    0.01% of all branches          ( +-  3.74% )

        2.8991 +- 0.0165 seconds time elapsed  ( +-  0.57% )

The page-faults metric caught my eye but I'm not sure what does it mean...

UPDATE:

gfortran version: 10.2.0

ifort version: 19.1.3.304

intel Xeon(R)

UPDATE:

similar example: Puzzling performance difference between ifort and gfortran

from this example:

When the complex IF statement is removed, gfortran takes about 4 times as much time (10-11 seconds). This is to be expected since the statement approximately throws out about 75% of the numbers, avoiding to do the SQRT on them. On the other hand, ifort only uses slightly more time. My guess is that something goes wrong when ifort tries to optimize the IF statement.

seem to be relevant to this case too

nadavhalahmi
  • 101
  • 6
  • 2
    Without source code, and compiler versions / options (and your CPU model number), we can't tell you why gfortran happened to make better asm than ifort in this specific case. It's unlike that page faults accounted for the entire difference since the total count is still pretty small. – Peter Cordes Mar 08 '21 at 08:52
  • The one biggest difference visible with this set of perf events is total instructions, and number of branches: perhaps ifort spent more instructions to make branchless asm somewhere where gfortran used a branch. But it was predictable enough to be worth it. – Peter Cordes Mar 08 '21 at 08:53
  • I updated the post with cpu and compilers versions – nadavhalahmi Mar 08 '21 at 09:05
  • I agree the instruction count might be relevant too. How can I control this optimization of removing branches? – nadavhalahmi Mar 08 '21 at 09:06
  • `-march=native` may set tuning parameters that bias the heuristics the compiler uses in any individual case. Or as [gcc optimization flag -O3 makes code slower than -O2](https://stackoverflow.com/q/28875325) mentions, profile-guided optimization can record what happens for each specific branch, letting the compiler make better choices (https://gcc.gnu.org/wiki/AutoFDO/Tutorial shows how to use -fprofile-generate / -fprofile-use). I think ICC probably has similar options. – Peter Cordes Mar 08 '21 at 09:12
  • 2
    You still didn't include any details about your CPU except that it's an Intel Xeon. Or compiler *options* you used. Or any source code. The details are likely to depend on some specific loop. – Peter Cordes Mar 08 '21 at 09:14

0 Answers0