When will compilers optimize assembly code in C/C++ source?

Question

Most of compilers do not optimize inline assembly code (VS2015, gcc), it allows us to write new instructions it doesn't support.

But when should a C/C++ compiler implement inline assembly optimizing?

It might do that when you request global optimization. Both g++ and MSVC support global optimization. — Cheers and hth. - Alf, Dec 23 '16 at 03:28
I'd hope never! If you cared enough to do inline assembly then you probably don't want it messed with. How does the compiler know that seemingly useless write to 0xbeefface isn't important to some embedded device? — John3136, Dec 23 '16 at 03:31
When you use inline assembly, you are basically telling the compiler that you know what you are doing and this will be better than what it can do. Why should the compiler attempt to optimize it? If you want the compiler to optimize your code you would write in the actual language the compiler is for, IMO. — Some programmer dude, Dec 23 '16 at 03:32
related: [**Can** compilers optimize assembly code in C/C++ source?](http://stackoverflow.com/questions/41285586/can-compilers-optimize-assembly-code-in-c-c-source). summary of answers: yes in theory they could, but in practice none that I know of do implement such a thing. This is one reason why [inline asm can *hurt* performance](https://gcc.gnu.org/wiki/DontUseInlineAsm), e.g. defeating constant-propagation optimizations when inlining with link-time code-gen. — Peter Cordes, Dec 23 '16 at 06:55
@Tommylee2k: I'm not sure. Maybe the near-duplicate question. I think what's happening here is an X-Y problem, and this OP hasn't realized that intrinsics are what you use if you want the compiler to optimize your code that uses fancy instructions, not inline asm. If that's the case, then it's not really a bad question, and answering it might help other people who are looking for "optimized inline asm" for the same reason. The "when should" part is pretty speculative and opinion-based, though. — Peter Cordes, Dec 23 '16 at 10:46
@John3136 Just to give a counter example, I have code here that moves an immediate to a register and then adds another immediate to that register right after (MOV and ADD) --> that could be seen as an optimization opportunity. Good it's not, or my program wouldn't work. [In case anyone wonders, I need this really, yes - I need to have them both separate so that the program can find all occurrences of one of them and replace them by a new value) — Edw590, Mar 04 '22 at 11:38
@DADi590 That's not a counter example. It's an example of EXACTLY what I said :-). — John3136, Mar 04 '22 at 17:19
Wow. This is awkward. My apologies. I could swear I had seen another message there. But I saw some threads on this, I must have put the comment on the wrong one . Sorry haha. Well, then there's a real-life example of what you said in case anyone would like one xD — Edw590, Mar 05 '22 at 16:07
@user904963 I could, but I just left it to be an example of what John3136 said (why wouldn't someone want to optimize a MOV and an ADD both with immediate values on them? I didn't) — Edw590, Mar 11 '22 at 16:04
@DADi590 I was pointing out you may have responded to someone who deleted his or her comment since you thought you had seen a particular comment. — user904963, Mar 11 '22 at 17:30

Peter Cordes · Answer 1 · 2016-12-24T07:55:49.160

10

Never. That would defeat the purpose of inline assembly, which is to get exactly what you ask for.

If you want to use the full power of the target CPU's instruction set in a way that the compiler can understand and optimize, you should use intrinsic functions, not inline asm.

e.g. instead of inline asm for popcnt, use int count = __builtin_popcount(x); (in GNU C compiled with -mpopcnt). Inline-asm is compiler-specific too, so if anything intrinsics are more portable, especially if you use Intel's x86 intrinsics which are supported across all the major compilers that can target x86. Use #include <x86intrin.h> and you can use int _popcnt32 (int a) to reliably get the popcnt x86 instruction. See Intel's intrinsics finder/guide, and other links in the x86 tag wiki.

int count(){ 
  int total = 0;
  for(int i=0 ; i<4 ; ++i)
    total += popc(i);
  return total;
}

Compiled with #define popc _popcnt32 by gcc6.3:

    mov     eax, 4
    ret

clang 3.9 with an inline-asm definition of popc, on the Godbolt compiler explorer:

    xor     eax, eax
    popcnt  eax, eax
    mov     ecx, 1
    popcnt  ecx, ecx
    add     ecx, eax
    mov     edx, 2
    popcnt  edx, edx
    add     edx, ecx
    mov     eax, 3
    popcnt  eax, eax
    add     eax, edx
    ret

This is a classic example of inline asm defeating constant propagation, and why you shouldn't use it for performance if you can avoid it: https://gcc.gnu.org/wiki/DontUseInlineAsm.

This was the inline-asm definition I used for this test:

int popc_asm(int x) {
  // force use of the same register because popcnt has a false dependency on its output, on Intel hardware
  // this is just a toy example, though, and also demonstrates how non-optimal constraints can lead to worse code
  asm("popcnt %0,%0" : "+r"(x));
  return x;
}

If you didn't know that popcnt has a false dependency on its output register on Intel hardware, that's another reason you should leave it to the compiler whenever possible.

Using special instructions that the compiler doesn't know about is one use-case for inline asm, but if the compiler doesn't know about it, it certainly can't optimize it. Before compilers were good at optimizing intrinsics (e.g. for SIMD instructions), inline asm for this kind of thing was more common. But we're many years beyond that now, and compilers are generally good with intrinsics, even for non-x86 architectures like ARM.

edited Dec 24 '16 at 07:55

answered Dec 23 '16 at 07:31

Peter Cordes

328,167
45
605
847

I have actually heard of such a thing as an "optimizing assembler", though I don't know of any for x86 (Google turned up [this](https://github.com/hundt98847/mao), but it appears to be dead). Most are either for embedded systems, or RISC-style architectures, where programming in assembly is exceedingly tedious because of all the registers and the nuances of instruction scheduling. So theoretically, integrating such a thing into a C compiler's inline assembly would be possible. I don't agree that this would defeat the purpose if inline asm, assuming it actually worked well! – Cody Gray - on strike Dec 23 '16 at 08:32
For example, writing in assembly for Itanium is a giant pain in the rear because you have to pay attention to [instruction bundles and slots](https://blogs.msdn.microsoft.com/oldnewthing/20150728-00/?p=90811), and a bunch of weird rules. The ISA was pretty much designed for a C/C++ compiler, and is so complicated that an optimizer is virtually required to have any hope of getting half-decent object code. An optimizing assembler would be rather cool. Although I guess the syntax of asm would make it difficult to implement. How would it know which instructions can be reordered? – Cody Gray - on strike Dec 23 '16 at 09:07
Why would you use *inline*-asm in the first place (instead of intrinsics) if you wanted the compiler to grok it and emit different instructions? The only reason I can think of is that C can't portably express e.g. an arithmetic right shift, and various other deficiencies. Intrinsics are the solution to the problem that I think this OP really has, and they're stuck in an X-Y problem on compiler-optimized inline-asm. – Peter Cordes Dec 23 '16 at 10:37
@CodyGray: Also, I have heard of whole-program asm-to-asm optimizers for x86 that treat asm as a source language ("binary optimizer"). And I did notice the over-generalization of calling it "the purpose of inline asm", when really choosing an exact insn sequence is only one of a many purpose. I thought about weaker wording, but decided to go with it since for performance reasons, choosing your own insn sequence is pretty much the only thing it gains you over intrinsics. And intrinsics almost always exist, since they're so much better than inline asm. – Peter Cordes Dec 23 '16 at 10:43
[STOKE](https://github.com/StanfordPL/stoke) is one type of optimizer that works at the assembly level. Of course, there are all kinds of pitfalls with that, such as either needing to be very conservative about assumptions about external state like memory (and ordering) or being less conservative and breaking things. So STOKE uses a specification and/or formal methods to prove equivalence, but you still have to make some assumptions. For inline assembly `gcc` makes the programmer tell you about the side effects of asm blocks so it can optimize around them. – BeeOnRope Dec 23 '16 at 21:59
Well, if the compiler is able to optimize the inline assembly, then there should be flags no to do it. – kelalaka Feb 08 '20 at 15:28

BeeOnRope · Answer 2 · 2016-12-23T19:40:14.893

In general, compilers will not optimize the content of your inline assembly. That is, they won't remove or change instructions in your assembly block. In particular, gcc simply passes through the body of your inline assembly unchanged to the underlying assembler (gas in this case).

However, good compilers may optimize around your inline assembly, and in some cases may even omit the execution inline assembly code entirely! Gcc, for example, can do this if it determines that the declared outputs of the assembly are dead. It can also hoist an assembly block out of a loop or combine multiple calls into one. So it never messes with the instructions inside the block, but it entirely reasonable to change the number of times a block would be executed. Of course, this behavior can also be be disabled if the block has some other important side effect.

The gcc docs on extended asm syntax have some good examples of all of this stuff.

When will compilers optimize assembly code in C/C++ source?

2 Answers2

Linked