25

According to most benchmarks, Intel's Clear Linux is way faster than other distributions, mostly thanks to a GCC feature called Function Multi-Versioning. Right now the method they use is to compile the code, analyze which function contains vectorized loops, then patch the code with FMV attributes and compile it again.

How feasible will it be for GCC to do it automatically? For example, by passing -mmultiarch=sandybridge,skylake (or a similar -m option listing CPU extensions like AVX and AVX2).

Right now I'm interested in two usage scenarios:

  1. Use this option for our large math-heavy program for delivering releases to our customers. I don't want to pollute the code with non-standard attributes and I don't want to modify the third-party libraries we use.
  2. The other Linux distributions will be able to do this easily, without patching the code as Intel does. This should give all Linux users massive performance gains.
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Alexander
  • 692
  • 6
  • 17
  • Why not just build the whole program for each target arch, and ship it will a shellscript that decides which build to run? You're potentially compiling multiple versions of _every_ function (loop vectorization might be the only case _you_ care about, but lots of other code might also benefit from target-specific optimization), so just compile multiple versions of everything in the first place. – Useless Feb 20 '18 at 11:55
  • 1
    @Useless Considering that the combined binaries are about 300MB total, and quite a lot of it is GUI code, it's just not practical. Sure, the math code is mostly in separate libraries, but a shell script won't help with that. According to docs, the size overhead of FMV is negligible. – Alexander Feb 20 '18 at 12:57
  • 2
    If the math code is in separate libraries, you can make target-specific builds of _those_ and set the appropriate `LD_LIBRARY_PATH` in your wrapper script. If you have other specific requirements (such as 300MB being considered a lot, or having lots of code you specifically want _not_ to optimize), put it in the question. – Useless Feb 20 '18 at 13:03
  • You could also implement the "GUI code" so it detects properties of the host system, and dynamically loads appropriate versions of the libraries at startup. That means only a need for one version of the GUI, but you'll need a scheme of working out how to map detected host properties to particular versions of each library. – Peter Feb 20 '18 at 13:10
  • While it is possible to do all that (with various degrees of code size / developer time overhead), this question is about FMV and the feasibility of gcc doing it automatically. – Alexander Feb 20 '18 at 13:44
  • 1
    I'm also very interested in this, for most of the same reasons you are. I asked about it here [link](https://stackoverflow.com/questions/39979926/is-there-or-will-there-be-a-global-version-of-the-target-clones-attribute) but the question didn't gain any traction. It's frustrating, because all the hard parts are done. – bolind May 04 '18 at 12:39
  • 3
    Heck, I even made a gcc feature request: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=78464 – bolind May 04 '18 at 12:43
  • 1
    It's not automatic within GCC, but [make-fmv-patch](https://www.phoronix.com/scan.php?page=news_item&px=GCC-Clear-make-fmv-patch) sounds as good as you can get still. – mirh Dec 02 '19 at 03:30
  • 1
    Since it bloats the code, it needs to be invoked thoughtfully. As more models are supported, the code would need to be recompiled to pick them up, and would grow even bigger. Some of the variations might have minuscule speed improvements in exchange for the bloat. You might try selling it to the gcc team as a for-speed-over-size-unconditionally sort of option! People with real speed needs, like arbitrage stock traders, first pick a CPU-of-the-week and code for that. Note that bad code design slows things more than wrong architecture! The world is moving more toward parallelism! – David G. Pickett Jul 02 '20 at 20:25
  • 3
    Multi-versioning also defeats inlining so it's very important to pick good boundaries for where to partition. Some functions simplify *a lot* when inlined into a caller that passes compile-time constants for some of the args, or a known-positive value, or whatever. Things like BMI1 / BMI2 are most useful when used everywhere in your code, not just in functions with big loops. – Peter Cordes Aug 22 '20 at 21:22
  • @Sajal Pushpad - Your edit introduced a problem: "bypassing" is a word, but it has a different meaning (like "going around") than the phrase "by passing". I changed that part back. The rest of your edit is pretty minor, not really worth the reviewers' time. Nothing in the question was actually worded wrong, and lower-case "gcc" is acceptable. It's true that upper-case GCC is more correct when we're talking about the GNU Compiler Collection, that includes `gcc`, `g++`, `gfortran`, etc, but IMO that's so minor it's only worth fixing if you have 2000 rep to bypass the review queue. – Peter Cordes Jan 21 '21 at 18:05
  • @Sajal: Still, thanks for your effort to improve Stack Overflow, and welcome to the site. – Peter Cordes Jan 21 '21 at 18:06

1 Answers1

1

No, but it doesn't matter. There's very, very little code that will actually benefit from this; for the most part by doing it globally you'll just (without special effort to sort matching versions in pages together) make your system much more memory-constrained and slower due to the huge increase in code size. Most actual loads aren't even CPU-bound; they're syscall-overhead-bound, GPU-bound, IO-bound, etc. And many of the modern ones that are CPU-bound aren't running precompiled code but JIT'd code (i.e. everything running in a browser, whether that's your real browser or the outdated and unpatched fork of Chrome in every Electron app).

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • I never mentioned that it's about desktop (Electron, etc...), the benchmarks linked above clearly show vast improvement in many areas. Besides, my desktop (KDE Plasma) is mostly written in C++, almost no JIT there. This could also be a somewhat suitable replacement for profile-guided optimization, which definitely improves snappiness of some programs, but is unavailable in most linux packages. – Alexander Dec 02 '20 at 06:47