A good optimizing compiler will already know all of the details of " hardware pre-fetch, cache, branch predict,pipeline" etc. You will need to tell the compiler what specific CPU you are targeting. For gcc use the -march
and -mtune
options as a starting point.
Experiment with different compilers like clang
and the Intel C compiler.
Profile your program with various input data, and identify where the bottlenecks are, then look into how to write faster code for the bottlenecks. There's almost always more to be gained by using a smarter algorithm than there is in tweaking the assembly code for a particular bottleneck.