I'd recommend compiling new code with -std=gnu11
, or -std=c11
if needed. Silencing all -Wall
warnings is usually a good idea, IIRC. -Wextra
warns for some things you might not want to change.
A good way to check how something compiles is to look at the compiler asm output. http://gcc.godbolt.org/ formats the asm output nicely (stripping out the noise). Putting some key functions up there and looking at what different compiler versions do is useful if you understand asm at all.
Use a new compiler version. gcc and clang have both improved significantly in newer versions. gcc 5.3 and clang 3.8 are the current releases. gcc5 makes noticeably better code than gcc 4.9.3 in some cases.
If you only need the binary to run on your own machine, you should use -O3 -march=native
.
If you need the binary to run on other machines, choose the baseline for instruction-set extensions with stuff like -mssse3 -mpopcnt
. You can use -mtune=haswell
to optimize for Haswell even while making code that still runs on older CPUs (as determined by -march
).
If your program doesn't depend on strict FP rounding behaviour, use -ffast-math
. If it does, you can usually still use -fno-math-errno
and stuff like that, without enabling -funsafe-math-optimizations
. Some FP code can get big speedups from fast-math, like auto-vectorization.
If you can usefully do a test-run of your program that exercises most of the code paths that need to be optimized for a real run, then use profile-directed optimization:
gcc -fprofile-generate -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
./my_program -option1 < test_input1
./my_program -option2 < test_input2
gcc -fprofile-use -Wall -Wextra -std=gnu11 -O3 -ffast-math -march=native -fwhole-program *.c -o my_program
-fprofile-use
enables -funroll-loops
, since it has enough information to decide when to actually unroll. Unrolling loops all over the place can make things worse. However, it's worth trying -funroll-loops
to see if it helps.
If your test runs don't cover all the code paths, then some important ones will be marked as "cold" and optimized less.
-O3
enables auto-vectorization, which -O2
doesn't. This can give big speedups
-fwhole-program
allows cross-file inlining, but only works when you put all the source files on one gcc command-line. -flto
is another way to get the same effect. (Link-Time Optimization). clang supports -flto
but not -fwhole-program
.
-fomit-frame-pointer
has been the default for a while now for x86-64, and more recently for x86 (32bit).
As well as gcc, try compiling your program with clang. Clang sometimes makes better code than gcc, sometimes worse. Try both and benchmark.