GCC recommendations and options for fastest code

Question

I'm distributing a C++ program with a makefile for the Unix version, and I'm wondering what compiler options I should use to get the fastest possible code (it falls into the category of programs that can use all the computing power they can get and still come back for more), given that I don't know in advance what hardware, operating system or gcc version the user will have, and I want above all else to make sure it at least works correctly on every major Unix-like operating system.

Thus far, I have g++ -O3 -Wno-write-strings, are there any other options I should add? On Windows, the Microsoft compiler has options for things like fast calling convention and link time code generation that are worth using, are there any equivalents on gcc?

(I'm assuming it will default to 64-bit on a 64-bit platform, please correct me if that's not the case.)

Be aware that fastcall isn't *always* faster. As always when it comes to performance, measure, measure, measure. It's only faster if you've got a benchmark showing that it is. — jalf, Jun 09 '10 at 12:16
It's difficult to know what optimizations might fasten the execution of your program when we don't know what it does. And even if we knew, there is so many differents configurations around the world that it is unlikely to find an option that works faster on almost all those configurations. — ereOn, Jun 09 '10 at 12:20
Depending on what your program does, you might be able to deactivate RTTI. It should get you some speed up, but forbid the use of a number of features. — Matthieu M., Jun 09 '10 at 12:28
Theorem prover. (All integer calculations, pointers and branches, but the answers discussing floating point might be useful for other people.) Of course it's always difficult to be sure, but at the end of the day one has to make a decision without having seen the user's machine. I tried disabling RTTI, but it didn't seem to make any difference in the generated code. — rwallace, Jun 09 '10 at 12:43
Now i know this might be a little off-topic, but i will try anyway: experimenting with compiler options might help you a lot, but if your code (algorithms) are slow, then no clever optimization will help you. Profile your code, find bottlenecks and fix them. I'm not saying that your code is bad, this is just a friendly suggestion! — PeterK, Jun 09 '10 at 12:44
PeterK - good heavens, yes. I've already implemented some optimizations worth more than 10 orders of magnitude speedup on typical problems (estimate, hard to test for obvious reasons) and there's a lot more to go. If accused of being therefore irrational for spending any time fiddling with compiler settings, I will plead _nolo contendere_ :-) — rwallace, Jun 09 '10 at 12:54
Over 10 orders of magnitude? You optimized something that has a runtime of roughly 4 months down to 1 millisecond? Wow. — Damon, Jan 24 '14 at 00:50
Build with LTO (Link Time Optimization) enabled - it slows down your compile but has a general positive effect on the speed of the resulting program. — Jesper Juhl, May 27 '18 at 12:49

score 22 · Accepted Answer · edited Jun 20 '20 at 09:12

22

Without knowing any specifics on your program it's hard to say. O3 covers most of the optimisations. The remaining options come "at a cost". If you can tolerate some random rounding and your code isn't dependent on IEEE floating point standards then you can try -Ofast. This disregards standards compliance and can give you faster code.

The remaining optimisations flags can only improve performance of certain programs, but can even be detrimental to others. Look at the available flags in the gcc documentation on optimisation flags and benchmark them.

Another option is to enable C99 (-std=c99) and inline appropriate functions. This is a bit of an art, you shouldn't inline everything, but with a little work you can get your code to be faster (albeit at the cost of having a larger executable).

If speed is really an issue I would suggest either going back to Microsoft's compiler, or to try Intel's. I've come to appreciate how slow some gcc compiled code can be, especially when it involves math.h.

EDIT: Oh wait, you said C++? Then disregard my C99 paragraph, you can inline already :)

edited Jun 20 '20 at 09:12

Community

1
1

answered Jun 09 '10 at 12:17

Il-Bhima

10,744
1
47
51

1

Funny, I've come to appreciate how slow MSVC compiled code can be :-) I also don't think that applies, as the poster seems to want GCC so it can target Spark, PPC, "every major Unix-like operating system". – phkahler Jun 09 '10 at 12:26
Actually I've never used the MSVC compiler. I'm surprised it's not fast, you would think MS would have optimised the crap out of it seeing that they probably have 90% of their software compiled on it. I am comparing to Intel's which I have used extensively. Yeah, I just realised that the OP wants it to target most unixes making it even harder to list a fixed set of opt flags. – Il-Bhima Jun 09 '10 at 12:35
I'm compiling the Windows binary with the Microsoft compiler (about 5 to 10% faster than GCC by my tests), this is for the Unix distribution. As far as I now understand it, -Ofast etc may (or may not) help floating-point code, but for integer code -O3 already gives you the full Monty? – rwallace Jun 09 '10 at 12:47
Ok as far as I know (and the doc seems to agree) Ofast is funsafe-math, which applies only to floating point math, so if you've got only integer math it's probably not going to help. However, I wouldn't say O3 gives you the full monty, there are other options which O3 doesn't use since they don't guarantee faster code. Optimisation is highly program and architecture dependent. Its possible that disabling one of the O3 opts could improve your performance. If performance is that vital benchmark your program on a set of machines and have a set of flags for each architecture in your makefile. – Il-Bhima Jun 09 '10 at 13:02

score 19 · Answer 2 · edited Jul 03 '14 at 22:28

I would try profile guided optimization:

-fprofile-generate Enable options usually used for instrumenting application to produce profile useful for later recompilation with profile feedback based optimization. You must use -fprofile-generate both when compiling and when linking your program. The following options are enabled: -fprofile-arcs, -fprofile-values, -fvpt.

You should also give the compiler hints about the architecture on which the program will run. For example if it will only run on a server and you can compile it on the same machine as the server, you can just use -march=native. Otherwise you need to determine which features your users will all have and pass the corresponding parameter to GCC.

(Apparently you're targeting 64-bit, so GCC will probably already include more optimizations than for generic x86.)

For those who are wondering how exactly to use guided optimization: https://stackoverflow.com/a/4366805 — Avamander, Nov 30 '19 at 14:53

score 11 · Answer 3 · edited Mar 03 '23 at 07:59

11

-OFast

Please try -OFast instead of -O3

Also here is a list of flags you might want to selectively enable.

-ffloat-store

-fexcess-precision=style

-ffast-math

-fno-rounding-math

-fno-signaling-nans

-fcx-limited-range

-fno-math-errno

-funsafe-math-optimizations

-fassociative-math

-freciprocal-math

-ffinite-math-only

-fno-signed-zeros

-fno-trapping-math

-frounding-math

-fsingle-precision-constant

-fcx-fortran-rules

A complete list of the flags and their detailed description is available here

edited Mar 03 '23 at 07:59

Osama Albahrani

13
2

answered Jun 09 '10 at 12:17

TheCodeArtist

21,479
4
69
130

1

`-fcx-limited-range` `-freciprocal-math` and `-fassociative-math` are listed twice. – Hydranix May 20 '16 at 01:09
1

Thanks @Hydranix. Removed the duplicates from the answer. – TheCodeArtist May 20 '16 at 02:46

Jendas · Answer 4 · 2021-05-31T08:37:17.587

8

Apart from what others have already suggested, try -flto. It enables link-time optimization which, in some cases, can do a real magic.

For further information see LLVM description and GCC optimize options

edited May 31 '21 at 08:37

answered Sep 21 '14 at 07:44

Jendas

3,359
3
27
55

score 8 · Answer 5 · edited Feb 12 '14 at 01:52

8

Consider using -fomit-frame-pointer unless you need to debug with gdb (yuck). That will give the compiler one more register to use for variables (otherwise this register is wasted for useless frame pointers).

Also you may use something like -march=core2 or more generally -march=native to enable the compiler to use newer instructions and further tune the code for the specified architecture, but for this you must be sure your code will not be expected to run on older processors.

edited Feb 12 '14 at 01:52

amaurea

4,950
26
35

answered Feb 23 '12 at 22:44

ohcul

89
1
1

I find Visual Studio Code (with the Microsoft C/C++ extension) and GDB work quite nicely together for debugging when on Linux – Hydranix Jun 02 '16 at 20:28
4

As a side note, -O and -O{n} all enable -fomit-frame-pointer without having to call it out :) https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html – rogerdpack Jun 28 '16 at 22:38
You don't need a frame pointer to debug with gdb. – doug65536 Jul 03 '21 at 10:16

score 5 · Answer 6 · answered Jun 09 '10 at 18:18

5

gcc -O3 is not guaranteed to be the fastest. -O2 is often a better starting point. After that, profile guided optimization and trying out specific options: http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html

It's a long read, but probably worth it.

Note that a "Link Time Code Generation" (MSVC) aka "Link Time Optimization" is available in gcc 4.5+

By the way, there is no specific "fastcall" calling convention for Win64. There is only "the" calling convention: http://msdn.microsoft.com/en-us/magazine/cc300794.aspx

answered Jun 09 '10 at 18:18

rubenvb

74,642
33
187
332

1

Concerning your point "By the way, there is no specific "fastcall" calling convention for Win64. There is only "the" calling convention": There also exists the "vectorcall" call convention under Win64: https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?redirectedfrom=MSDN&view=vs-2019 – Nubok Oct 28 '19 at 10:13

score 1 · Answer 7 · answered Jan 24 '14 at 00:25

1

There is no 'fastcall' on x86-64 - both Win64 and Linux ABI define register-based calling ("fastcall") as the only calling convention (though Linux uses more registers).

answered Jan 24 '14 at 00:25

Flo

11
1

Under 64 bit Windows, there also exists the "vectorcall" call convention: https://learn.microsoft.com/en-us/cpp/cpp/vectorcall?redirectedfrom=MSDN&view=vs-2019 – Nubok Oct 28 '19 at 10:15

GCC recommendations and options for fastest code

7 Answers7

-OFast

Linked