2

Before anyone would tell me to look up old answers or RTFM, please note that I've already done so, so please read the details before directing me to look elsewhere.

I've established that the difference in Optimization levels isn't as simple as some different types of optimization flags having been enabled for a higher optimization level.

For example, I first found the difference in optimization flags of O0 and O1 by following these steps:

gcc -c -Q -O1 --help=optimizers > /tmp/O1-opts
gcc -c -Q -O0 --help=optimizers > /tmp/O0-opts
diff /tmp/O0-opts /tmp/O1-opts | grep enabled

This gave me a list of various optimization flags enabled by O1 over O0.

Then, I compiled the code with -O0 but added all the individual optimization flags enabled by O1 over O0, because the result should be same as O1, right? Well, guess what, it's not!

So, this proves that the difference between optimization levels is not simply the different types of optimization flags used. I mean there must be more differences in optimizations besides the optimization flags that gcc/g++ displays.

Please let me know if someone already knows the answer to this question, or I'll have to look up the source-code of gcc, which wouldn't be trivial for me. Thank you!

As to the reason for why I'm looking for this info, I've some AVX-512 code that experiences less than 3% L1D cache misses with O0 or no optimization flag, but more than 37% (although it speeds up the code) with O1 and beyond. If I can figure it which (hidden) flag is causing it, I might be able to speed up the code even further. There are too many flags in the common.opt file in the gcc source code, so I've hit a wall.

  • 1
    There are some optimizations that don't have specific flags associated with them . Reading the gcc source code from the gcc version you are using is probably the best way to proceed – M.M Feb 25 '20 at 02:35
  • 4
    Can't say about the optimization settings, but a difference in percent of cache misses may not be meaningful. You can have situations where the O0 has a bunch of redundant/repetitive memory accesses that lead to hits, and if they get eliminated by O1 then the miss percentage may well go up. I'd suggest looking at the actual number of misses, not the percentage. – bg2b Feb 25 '20 at 02:35
  • @M.M Thanks for trying to help! – AlwaysNeedsHelp Feb 25 '20 at 04:04
  • 2
    @bg2b Thanks! You're absolutely right! The absolute number of L1D misses doesn't change much from O0 to O1. I see now why the percentage seems inflated. – AlwaysNeedsHelp Feb 25 '20 at 06:23

1 Answers1

1

-O0 is special, and implies spill/reload between every statement for consistent debugging: Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? - You'll still see vars still being stored/reloaded kind of like volatile; there's no -f option to change that.

-O0 also means to disable optimization in general.

-f optimization options don't work at -O0; optimization has to be enabled (-Og or -O1 or higher) for them to do anything. (Except for maybe a couple special cases.) See also another Q&A reporting no difference in asm, and an answer quoting the GCC manual Not all optimizations are controlled directly by a flag.

You could maybe use -O1 and use -fno-foo -fno-bar ... to disable the options that -O1 mentions enabling, and get different code-gen from -O0.

Options are also visible in GCC's asm comments with -S -fverbose-asm -o- output.


Also, running slower (because of store/reload or any other reason) gives HW prefetch more time to keep up and have data ready in L2 or even L1d before a load uop executes and has a demand miss.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847