79

With most C/C++ compilers, there's a flag passable to the compiler, -march=native, which tells the compiler to tune generated code for the micro-architecture and ISA extensions of the host CPU. Even if it doesn't go by the same name, there's typically an equivalent option for LLVM-based compilers, like rustc or swiftc.

In my own experience, this flag can provide massive speedups for numerically-intensive code, and it sounds like it would be free of compromises for code you're just compiling for your own machine. That said, I don't think I've seen any build system or static compiler that enables it by default:

  • Obviously, any command-line compiler executable that requires you to pass it doesn't use it by default.

  • I can't think of any IDE that enables this by default.

  • I can't think of any common build system I've worked with (cmake, automake, cargo, spm, etc.) that enables it by default, even for optimized builds.

I can think of a few reasons for this, but none of them are really satisfactory:

  • Using -march=native is inappropriate for binaries that will be distributed to other machines. That said, I find myself compiling sources for my own machine much more often than for others, and this doesn't explain its lack of use in debug builds, where there's no intention for distribution.

  • At least on Intel x86 CPUs, it's my understanding that using AVX instructions infrequently could degrade performance or power efficiency, since the AVX unit is powered down when not in use, requiring it to be powered up to be used, and a lot of Intel CPUs downclock to run AVX instructions. Still, it only explains why AVX wouldn't be enabled, not why the code wouldn't be tuned for the particular micro-architecture's handling of regular instructions.

  • Since most x86 CPUs use fancy out-of-order superscalar pipelines with register renaming, tuning code for a particular micro-architecture probably isn't particularly important. Still, if it could help, why not use it?

Stargateur
  • 24,473
  • 8
  • 65
  • 91
lcmylin
  • 2,552
  • 2
  • 19
  • 31
  • Programmers tend to have nice computers. Especially so C++ programmers, building large C++ apps is not much fun. Much nicer than their customers have. – Hans Passant Oct 04 '18 at 18:16
  • 29
    People generally like their compiled code to run on machines other than the one it was compiled on. –  Oct 04 '18 at 18:19
  • 2
    I think Gentoo users use it all the time. Other than that - doesn't give you that much most of the time, binaries cannot be used on other machines. Don't forget that you are probably dynamically linking with other libs that might not be tuned, so optimizing your app like that might give you nothing. –  Oct 04 '18 at 18:20
  • I find it often makes a massive difference and frequently suggest it in answers when user is looking for optimal speed https://stackoverflow.com/a/52610569/2836621 – Mark Setchell Oct 04 '18 at 18:24
  • 9
    This “opinion-based” reason was misapplied to this question. As the text notes, that closure reason is for questions whose answers are likely to be “almost entirely based on opinions.” This is not a contentious issue, and the facts about using `-march=native` would be useful to present. It ought to be reopened. – Eric Postpischil Oct 04 '18 at 20:16
  • @EricPostpischil OP is already aware of the reasons not to use `-march=native` and they're included in the question. The only thing remaining to answer is how often programmers distribute compiled binaries to other machines or whether they should do so, which is mostly a matter of opinion. – interjay Oct 04 '18 at 22:06
  • 2
    This question would be better suited to a discussion forum – M.M Oct 04 '18 at 22:23
  • 5
    @interjay: A purpose of Stack Overflow is to create a repository of questions and answers to provide information for others seeking in the future, not merely to provide information to one person asking a question. – Eric Postpischil Oct 04 '18 at 23:03

4 Answers4

46

Conservative

If you take a closer look at the defaults of gcc, the oldest compiler in your list, you'll realize that they are very conservative:

  • By default, on x86, only SSE 2 is activated; not even SSE 4.
  • The set of flags in -Wall and -Wextra has not changed for years; there are new useful warnings, they are NOT added to -Wall or -Wextra.

Why? Because it would break things!

There are entire development chains relying on those convenience defaults, and any alteration brings the risk of either breaking them, or of producing binaries that will not run on the targets.

The more users, the greater the threat, so developers of gcc are very, very conservative to avoid world-wide breakage. And developers of the next batch of compilers follow in the footsteps of their elders: it's proven to work.

Note: rustc will default to static linking, and boasts that you can just copy the binary and drop it on another machine; obviously -march=native would be an impediment there.

Masses Friendly

And in the truth is, it probably doesn't matter. You actually recognized it yourself:

In my own experience, this flag can provide massive speedups for numerically-intensive code

Most code is full of virtual calls and branches (typically OO code) and not at all numerically-intensive. Thus, for the majority of the code, SSE 2 is often sufficient.

The few codebases for which performance really matters will require significant time invested in performance tuning anyway, both at code and compiler level. And if vectorization matters, it won't be left at the whim of the compiler: developers will use the built-in intrinsics and write the vectorized code themselves, as it's cheaper than putting up a monitoring tool to ensure that auto-vectorization did happen.

Also, even for numerically intensive code, the host machine and the target machine might differ slightly. Compilation benefits from lots of core, even at a lower frequency, while execution benefits from a high frequency and possibly less cores unless the work is easily parallelizable.

Conclusion

Not activating -march=native by default makes it easier for users to get started; since even performance seekers may not care for it much, this means there's more to lose than gain.


In an alternative history where the default had been -march=native from the beginning; users would be used to specify the target architecture, and we would not be having this discussion.

Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • 2
    How do i request all those wonderful warnings without calling each one by name? – L29Ah Sep 24 '19 at 20:57
  • 4
    @L29Ah: Using Clang, you can use `-Weverything`. Using gcc, you have to enable them one at a time -- there are a few families of 2-3 warnings, but not order of magnitude improvements. – Matthieu M. Sep 25 '19 at 05:55
17

-march=native is a destructive flag. It makes the binary possible not compatible on a lot of hardware (basically any CPU that is not a direct descendent of the one used for compilation). It is simply too dangerous to enable this by default.

Another important thing to consider is that -march=native's main end use is optimization. The default optimization flag is -O0 (no optimization) so it wouldn't make sense from this perspective either to enable it by default.

bolov
  • 72,283
  • 15
  • 145
  • 224
  • What make it dangerous ? "The default optimization flag is -O0 (no optimization) so it wouldn't make sense from this perspective either to enable it by default." ? Is there a law that says the default optimization flag must be `-O0` – Stargateur Oct 05 '18 at 04:52
  • 3
    The reasoning about `-O0` doesn't make sense; `-march=native` has no effect with `-O0`, so there's no reason not to activate it by default, and it would still benefit any other optimization level. – Matthieu M. Oct 05 '18 at 06:48
  • 1
    @MatthieuM. my point is this: with `-O0` you basically don't care about optimizations. Since the default is "I don't care about optimization" why would a flag that mainly is used for optimization be enabled in default? – bolov Oct 05 '18 at 11:16
  • @bolov: It could be enabled by default when you ask for optimizations, no? – Matthieu M. Oct 05 '18 at 12:03
  • @MatthieuM. your great answer clearly shows you have a good grasp on this topic. Why are you being so difficult here? It would be wrong for an optimization flag to also modify the target architecture. And I am sure you know, understand and agree with this. – bolov Oct 05 '18 at 12:37
  • 2
    @bolov: I think you misunderstood my comment. My point is that there is ALWAYS an architecture specified, regardless of the optimization level. After all, even at `-O0`, the compiler needs to emit assembly instructions for a specific CPU family. For `-O0`, whether `-march=native` or `-march=` is the default still specifies the same family, so both are perfectly compatibly with `-O0`; and whenever another optimization level is specified, `-march=native` is beneficial to performance. So, for me, the fact that `-O0` is the default doesn't matter for `-march`'s default. – Matthieu M. Oct 05 '18 at 14:04
  • Actually it's wrong to say -O0 means you don't care about optimisation, -O0 is actually optimised for fast compile time. The higher optimisation level slows down compile time, but developers do a lot of recompilation during development cycle, so if you want quick turnaround during development it's often better to compile and test on -O0. – Lie Ryan Oct 05 '18 at 17:14
6

You are thinking from the perspective of power user, but the main audience of a compiler tool chain is not power users, but rather developers.

Most developers have separate development machine and target production systems. In case of consumer applications, this target system is other people's machine with all the variances. Building for the most common denominator is a safe default because it reduces the chance of bugs that only occurs outside the developer's own machines.

Of course there are cases where developers know that they'll be developing an application for a single target machine with known architecture. But even in this case, most applications are not performance sensitive, so the safe option as a default still works good enough, while developers who are working performance sensitive application usually are more willing to spend time to tweak their build configurations.

Lie Ryan
  • 62,238
  • 13
  • 100
  • 144
0

Answer is already answered just showing the difference between O3 and march=native. I am creating 3D videos with a lot of math. Original with O3 not set and march=native.

100 out of 900 %11.2222 time left: 0:0:7 time since: 1
200 out of 900 %22.3333 time left: 0:0:6 time since: 2
300 out of 900 %33.4444 time left: 0:0:5 time since: 3
400 out of 900 %44.5556 time left: 0:0:6 time since: 5
500 out of 900 %55.6667 time left: 0:0:4 time since: 6
600 out of 900 %66.7778 time left: 0:0:3 time since: 7
700 out of 900 %77.8889 time left: 0:0:2 time since: 8
800 out of 900 %89 time left: 0:0:1 time since: 9
Finished it took 0:0:10

If I put in the O3 optimisation with march=native the output is as follows:

100 out of 900 %11.2222 time left: 0:0:0 time since: 0
200 out of 900 %22.3333 time left: 0:0:3 time since: 1
300 out of 900 %33.4444 time left: 0:0:1 time since: 1
400 out of 900 %44.5556 time left: 0:0:2 time since: 2
500 out of 900 %55.6667 time left: 0:0:1 time since: 2
600 out of 900 %66.7778 time left: 0:0:1 time since: 3
700 out of 900 %77.8889 time left: 0:0:1 time since: 4
800 out of 900 %89 time left: 0:0:0 time since: 4
Finished it took 0:0:5

So the O3 optimisation really helps.

EDIT as per a new comment, program has evolved a bit since yesterday so times are a bit higher. This is O3 march=native now:

100 out of 900 %11.2222 time left: 0:0:15 time since: 2
200 out of 900 %22.3333 time left: 0:0:10 time since: 3
300 out of 900 %33.4444 time left: 0:0:9 time since: 5
400 out of 900 %44.5556 time left: 0:0:7 time since: 6
500 out of 900 %55.6667 time left: 0:0:6 time since: 8
600 out of 900 %66.7778 time left: 0:0:4 time since: 9
700 out of 900 %77.8889 time left: 0:0:3 time since: 11
800 out of 900 %89 time left: 0:0:1 time since: 12
Finished it took 0:0:14

If I take out the march=native:

100 out of 900 %11.2222 time left: 0:0:15 time since: 2
200 out of 900 %22.3333 time left: 0:0:13 time since: 4
300 out of 900 %33.4444 time left: 0:0:11 time since: 6
400 out of 900 %44.5556 time left: 0:0:9 time since: 8
500 out of 900 %55.6667 time left: 0:0:7 time since: 10
600 out of 900 %66.7778 time left: 0:0:5 time since: 12
700 out of 900 %77.8889 time left: 0:0:3 time since: 14
800 out of 900 %89 time left: 0:0:1 time since: 16
Finished it took 0:0:18
  • Was your first run comparing `gcc -march=native` *without* any `-O` options at all? The default is `-O0`, no optimization, let alone auto-vectorization where `-march=native` is most often useful. (Although also for variable-count shifts and bit-manipulation). `-march=native` has very little effect without `-O2` or `-O3`; you should be comparing `-O3 -march=native` vs. `-O3` with the default generic target architecture and tuning. Possibly also with `-ffast-math` if you want to experiment with allowing some FP rounding differences and stuff like that. – Peter Cordes Mar 15 '23 at 07:18
  • 1
    Cheers. Yeah you are right, the -march=native with -O3 is a lot different to -march=native with -O3. Edited the results with this output. – Tangata rereke Mar 16 '23 at 21:42