Compiler optimization: g++ slower than intel

Question

I recently acquired a computer with dual-boot to code in C++. On windows I use intel C++ compiler and g++ on linux. My programs consist mostly of computation (fixed point iteration algorithm with numerical integration, etc.).
I thought I could get performances close to windows on my linux, but so far I don't: for the exact same code, the program compiled with g++ is about 2 times slower than the one with intel compiler. From what I read, icc can be faster, maybe even up to 20-30% gains, but I did not read anything about it being twice as fast (and in general I actually read that both should be equivalent).

At first I was using flags which are approximately equivalent:

icl /openmp /I "C:\boost_1_61_0" /fast program.cpp

and

g++ -o program program.cpp -std=c++11 -fopenmp -O3 -ffast-math

Following advices from several other topics I tried adding/replacing several other flags like: -funsafe-math-optimizations, -march=native, -fwhole-program, -Ofast etc. with only slight (or no) performances gain.

Is icc really faster or am I missing something? I'm fairly new to linux so I don't know, maybe I forgot to install something properly (like a driver), or to change some option in g++ ? I have no idea whether the situation is normal or not, that's why I prefer to ask. Especially since I prefer to use linux to code ideally, so I would rather have it be up to speed.

EDIT: I decided to install the last intel compiler (Intel Compiler C++ 17, update4) on linux to check. I end up with mitigated results: it does NOT do better than gcc (even worse in fact). I ran cross comparison linux/windows - icc/gcc - parallelized or not, using the flags mentioned earlier in the post (to make direct comparisons), here are my results (time to run 1 iteration measured in ms):

Plain loop, no parallelization:
- Windows:
  gcc = 122074 ; icc = 68799
- Linux:
  gcc = _91042 ; icc = 92102
Parallelized version:
- Windows:
  gcc = 27457 ; icc = 19800
- Linux:
  gcc = 27000 ; icc = 30000

To sum up: it's a bit of a mess. On linux, gcc seems to always be faster than icc, especially when parallelization is involved (I ran it for longer program, the difference is much higher than the one here).
On windows, it's the opposite and icc clearly dominates gcc, especially when there is no parallelization (in which case gcc takes a really long time to compile).

The fastest compilation is done with parallelization and icc on windows. I don't understand why I cannot replicate this on linux. Is there anything I need to do (ubuntu 16.04) to help fasten my processes?
The other difference is that on windows I use an older intel composer (Composer XE 2013) and call 'ia32' instead of intel64 (which is the one I should be using) while on linux I use the last version that I installed yesterday. And on linux, the Intel Compiler 17 folder is on my second hdd (and not my ssd on which linux is install) I don't know if this might slow things down too.
Any idea where the problem may come from?

Edit: Exact hardware: Intel(R) Core(TM) i7-4710HQ CPU @ 2.50GHz, 8 CPU, 4 cores, 2 threads per core, architecture x86_64 - Linux Ubuntu 16.04 with gcc 5.4.1 and Intel Compiler 17 (update4) - Windows 8.1, Intel Composer 2013

Edit: Code is very long, here is the form of the loop that I'm testing (i.e. just one iteration of my fixed point iteration). It's very classic I guess... not sure it can bring anything to the topic.

// initialization of all the objects...
// length_grid1 is about 2000
vector< double > V_NEXT(length_grid1), PRICE_NEXT(length_grid1);
double V_min, price_min; 
#pragma omp parallel
{ 
#pragma omp for private(V_min, price_min, i, indexcurrent, alpha, beta)
    for (i = 0; i < length_grid1; i++) {
         indexcurrent = indexsum[i]; 
         V_min = V_compute(&price_min, indexcurrent, ...);
         V_NEXT[indexcurrent] = V_min; PRICE_NEXT[indexcurrent] = price_min;
     }
 }// end parallel

where V_compute function is a classic and simple optimization algorithm (customized golden search) returning the optimal value and its argument:

double V_compute(double *xmin, int row_index, ... ) {
double x1, x2, f1, f2, fxmin;
// golden_ratio=0.61803399; 
x1 = upper_bound - golden_ratio*(upper_bound - lower_bound);
x2 = lower_bound + golden_ratio*(upper_bound - lower_bound);

// Evaluate the function at the test points
f1 = intra_value(x1, row_index, ...);
f2 = intra_value(x2, row_index, ...);

while (fabs(upper_bound - lower_bound) > tolerance) {
    if (f2 > f1){
        upper_bound = x2; x2 = x1; f2 = f1;
        x1 = upper_bound - golden_ratio*(upper_bound - lower_bound);
        f1 = intra_value(x1, row_index, ...);
    } else {
        lower_bound = x1; x1 = x2; f1 = f2;
        x2 = lower_bound + golden_ratio*(upper_bound - lower_bound);
        f2 = intra_value(x2, row_index, ...);
    }
}
// Estimated minimizer = (lower bound + upper bound) / 2
*xmin = (lower_bound + upper_bound)/2;
fxmin = intra_value(*xmin, row_index, ...);
return - fxmin; }

The function optimized (intra_value) is quite complicated in terms of computation (pick a grid point (row_index) from precompiled grid, then involve a lot of numerical integration, etc.).

You might want to try compiling your code in GCC on Windows. You can use [mingw-w64](http://mingw-w64.org/doku.php). — Bernard, Jun 05 '17 at 13:32
Maybe adding -mavx or/and -mfpmath=sse would make GCC code faster? I have a gut feeling icl using SIMD by default. In addition you could use https://gcc.godbolt.org/ to see the assembly generated for your code using g++ or icl — kreuzerkrieg, Jun 05 '17 at 13:41
A few years ago, I was working on a project where compiling with Intel on Linux gave us roughly 10x the speed of GCC on Linux. So it's not impossible. — Angew is no longer proud of SO, Jun 05 '17 at 13:41
_"the compilation with g++ is about 2 times slower"_ I assume you mean the compiled program is slower, not the compilation is slower. Saying the compilation is slower means it takes longer to compile the code, but I don't think that's what you're talking about — Jonathan Wakely, Jun 05 '17 at 13:42
Do you run it on Intel CPU? If not mistaken, the ICC could provide much better optimization (in comparison with itself) for Intel processors, and possibly faster than other compilers (due to their own knowledge of how to utilize their CPU instructions in a best way). — Mikhail Churbanov, Jun 05 '17 at 13:46
Adding -mavx or -mfpmath=sse did not improve my performance. I m indeed running on Intel CPU so the better performance kind of make sense, but this large it surprises me. I m trying to compile with gg on windows to see the difference. — G. Ander, Jun 05 '17 at 15:50
did you try profile-guided optimization with [`-fprofile-generate` then `-fprofile-use`](https://stackoverflow.com/q/4365980/995714)? You should also try `-march=native` — phuclv, Jun 05 '17 at 16:07
Yep I did that already and profile-guided optimization works quite well indeed (I gain about 5-10% of computation time) ! I did not mention it here in order to directly compare gcc and icc with the same flags. — G. Ander, Jun 05 '17 at 16:27
I just ran the code with gcc on windows, the problem seems to come from icc vs gcc only because my program compiled with gcc on windows is in fact even longer than the one on linux. I guess I ll just have to install icc on linux afterall. — G. Ander, Jun 05 '17 at 16:29
It would help if you were willing to post source code and assembly dumps for both compilers. — zwol, Jun 05 '17 at 20:27
My code is way too large (about 5k lines) for it to be posted. It's basically a fixed point (value function) iteration algorithm, with a complex value function to compute each time (involving numerical simple and double integration and distribution functions using boost), which lead to a lot of lines (too many) of computation objects. — G. Ander, Jun 06 '17 at 11:30
You need to post some code. Particular the critical loop. Look at the assembly as well. Otherwise everyone is only guessing. Also state exactly what your hardware is, what version of the compilers, what version of Linux and Windows. — Z boson, Jun 06 '17 at 11:57
Consider binding the treads `export OMP_PROC_BIND=true`. What kind of scaling to you expect? Should it scale with the number of cores/SIMD lanes or is it memory bandwidth bound? You should be able to provide a "back of the envelop" estimate of the scaling you expect and explain why. — Z boson, Jun 06 '17 at 12:00
GCC's major weakness is carried dependency chains. It does not unroll the loop to break the dependency. Intel and Clang do a much better job of breaking dependency chains. — Z boson, Jun 06 '17 at 12:03
I added some code. I don't think it can help at all. I am not familiar with options to show assembly, etc. I tried something yesterday (command to print all source code + comment about assembly next to it) but it was way too long I cannot do anything about this here (and I already forgot what option I used). — G. Ander, Jun 06 '17 at 13:22
As for the binding: I just need to add export OMP_PROC_BIND=true in terminal before compiling? how does it work? Because I did and it did not change anything. As for the "scaling I expect" I really don't know much about it, it's a bit technical for me... I edited the first post to put my config. But anyway, even without parallelization, icc on windows is faster so it should not be related. — G. Ander, Jun 06 '17 at 13:26
Can you show some assembly from that code? Comparing the performance of two compilers really requires it — Steve Cox, Jun 06 '17 at 13:30
also, the inner loop of v_compute, would be much more helpful than the outer loop for a real comparison — Steve Cox, Jun 06 '17 at 13:31
How do I show assembly for these specific lines? (as mentioned above I don t know what command to use). I m going to describe the inner loop — G. Ander, Jun 06 '17 at 13:34
code for the inner "loop" of V_compute added. (for intra_value I will not be able to put it though) — G. Ander, Jun 06 '17 at 13:45
"The function optimized (intra_value) is quite complicated in terms of computation" Ok so a general rule of optimization is that the computationally complicated bits are the important part. If that inner function takes 5-10 times longer to run than any of this outer control logic (it sounds like it does from your description) then that is likely where all of the difference between the optimizers is going to be made up. You should profile your code to see if this is true. There are helpful tools like intels vtune or linux perf that can help guide you. — Steve Cox, Jun 06 '17 at 13:55
you don't post the whole code here, only an [MCVE](https://stackoverflow.com/help/mcve)/[SSCCE](http://sscce.org/) or the main bottleneck part. You can check which instructions in the assembly output correspond to which line in the C code with https://gcc.godbolt.org/ — phuclv, Jun 06 '17 at 14:35
and the fact that ICC isn't install on SSD isn't important. It just makes the compilation slower. The location of the compiled binary is more important and you should put it on the SSD to compare — phuclv, Jun 06 '17 at 14:45
I'm running some tests, the problem MIGHT come from the boost distribution (lgamma function in particular). I'm trying to put a simple example on gcc.godbolt, but how to link boost in there (is this even possible)? I've tried usual link flags but I don't know if any version is available at all there. Otherwise, any other way to show assembly differences? — G. Ander, Jun 07 '17 at 16:22
Also I just noticed that intel compiler 2013 on windows uses settings from visual studio 10 x86 tools. And thus uses c++98 and not c++11. Could this be source of problem? — G. Ander, Jun 07 '17 at 17:56

Jonathan Wakely · Answer 1 · 2017-06-05T13:51:15.010

20

It looks like you're using OpenMP, and so I suspect the difference is in the OpenMP implementation, not just the quality of the optimized code.

Intel's OpenMP runtime is known to be quite high performance, and GCC's is good but not great.

OpenMP programs have very different performance characteristics, they don't just depend on how well the compiler can optimize loops or inline function calls. The implementation of the OpenMP runtime matters a lot, as well as the OS implementation of threads and synchronization primitives, which are quite different between Windows and GNU/Linux.

edited Jun 05 '17 at 13:51

answered Jun 05 '17 at 13:41

Jonathan Wakely

166,810
27
341
521

1

I tried to run a program without parallelization, it ends up being 1.7 times longer with g++. This is a considerable improvement so it seems that it was indeed part of the story here. – G. Ander Jun 05 '17 at 15:45
It might not have such an important impact after all, cf the cross-comparisons I added in the main question – G. Ander Jun 06 '17 at 11:12

Jesper Juhl · Answer 2 · 2017-06-05T13:47:42.003

12

Note that "fast-math" breaks some language rules to get fast code and may produce incorrect results in some cases.

Also note that -O3 is not guaranteed to be faster than -O2 or any of the other optimization levels (it depends on your code) - you should test multiple versions.

You may also want to enable -Wl,-O1 - the linker also can do some optimizations.

You may also want to try building with LTO (link time optimization) - it can often yield significant improvements.

I realize this does not answer your question as such. But it should give you some things to play with :-)

Also, gcc is improving pretty fast. You may want to try a newer version if you are not already on 7.1. Also; try Clang for a third datapoint. Additionally, you can use icc on Linux if you want to.

edited Jun 05 '17 at 13:47

answered Jun 05 '17 at 13:40

Jesper Juhl

30,449
3
47
70

I tried most of these, without any considerable improvement. As for icc, I would like to use it on linux but it requires a lot of disk space that I don t have on the linux partition and I have not found any solution for this yet. – G. Ander Jun 05 '17 at 15:48
1

@G. Ander - "As for icc, I would like to use it on linux but it requires a lot of disk space that I don t have" - sounds like an easy problem to fix. Buy more disk space ;-) – Jesper Juhl Jun 05 '17 at 15:56
1

@G.Ander you can map a virtual disk image file in Linux without problem if you have other partitions, then install to that disk – phuclv Jun 05 '17 at 16:05
I finally installed it on my second hdd. The results are not as good as expected. (cf EDIT comparison in main post) – G. Ander Jun 06 '17 at 11:31

Compiler optimization: g++ slower than intel

2 Answers2