Optimizer: optimize inline assembly

Question

I've heard a couple of times that compiler would not optimize inline assembly or that inline assembly is sort of a blackbox for it. I was suspicious and because I haven't seen any cases when the compiler failed, I didn't care.

But today I found a page on GCC wiki titled DontUseInlineAsm. It contained the same issues that people told me before, but there are no details on why compiler wouldn't understand inline asm and therefore wouldn't optimize it. So, does anyone know reasons for compilers to not do these optimizations?

Of course, I'm putting away special cases like

asm volatile("" : : "g"(value) : "memory"); or
asm volatile("" : : : "memory");

when we are explicitly telling the compiler that this code has visible side effects and therefore it shouldn't optimize it away.

Optimizing the code itself and "optimizing away" (that is removing) are two different things, not sure which one you are asking about. — Jester, Nov 13 '16 at 19:44
@jester, yep, I understand the difference. Just wanted to point that I understand that there are cases when compiler won't be able to do anything with inline assembly — Kostya, Nov 13 '16 at 19:47
The typical process is to set the compiler to maximum optimization and compile the function. You print out the assembly language listing of the function. Use this as a base for your assembly code; optimize the compiler's generated assembly. — Thomas Matthews, Nov 13 '16 at 21:29

Mike Nakis · Accepted Answer · 2016-11-13T20:06:40.413

11

Your question appears to be based on the wrong assumption that a compiler first produces assembly, and then, if you want optimized output, then it reads the assembly that it produced, optimizes it, and writes it back. If that was the case, then it should be no big deal to also read and optimize your inline assembly, right?

The compiler does not optimize your inline assembly because the compiler does not optimize any assembly at all, ever. The compiler has no means of understanding assembly at the level required in order to perform optimizations with it. It is none of its business.

The compiler produces optimized machine code by doing special tricks with its internal data structures, (parse trees, intermediate languages like p-code, etc.) which are not assembly.

If an assembly-generation step is involved, it is a write-only step, meaning that the compiler will generate this assembly for you but it will never attempt to read it. That's the job of an assembler. And I never heard of an optimizing assembler.

Therefore, it is pretty safe to assume that no compiler will ever attempt to optimize anyone's inline assembly.

_{And I do not know about you, but frankly, I would be pretty annoyed if a compiler ever attempted to modify my inline assembly. If I am to use assembly, I will do it precisely because I know (or I think I know) better than the compiler.}

edited Nov 13 '16 at 20:06

answered Nov 13 '16 at 19:46

Mike Nakis

56,297
11
110
142

Lets take for example `rotl` (left cyclic shift) instruction. There are no builtin functions for it in GCC and since I don't want to write inline assembly every now and then I create a function `CyclicShiftLeft(x, n)`. And then I write `CyclicShiftLeft(CyclicShiftLeft(x, n), k)` (maybe this is a result of inlining) I would expect that it would be optimised into `CyclicShiftLeft(x, n + k)`. I think compiler should be able to do such optimisations (I didn't check), but to do them compiler should be able to optimise inline assembly. – Kostya Nov 13 '16 at 20:04
1

Well, yes, except that you cannot really dictate what the compiler *should* be able to do, unless you are head of development in a company that produces compilers. A compiler will never optimize inline assembly, therefore you will never achieve this fairly simple optimization that your example requires. On the other hand, if the GCC is smart enough, then it may be able to detect that a sequence of C instructions that emulate a ROTL by doing shift-and-mask is in fact trying to do a ROTL, and replace them with an actual ROTL instruction. Have you tried that? – Mike Nakis Nov 13 '16 at 20:12
I've tried and it replaced hand rolled C implementation of ROTL with single instruction :) I'm just curious, because from my perspective compiler can replace assembly with it's internal byte code and then apply it's usual optimisation pipeline. – Kostya Nov 13 '16 at 20:18
2

Right, so I predicted correctly. I have a snug feeling about this now. C-:= Well, as I said, the compiler does not do such things as "replace assembly with it's internal byte code". Transforming assembly to anything but machine code is an extremely hard task, and I doubt that there is any tool out there that is even remotely successful at it. So, if you feel you are up to it, then get yourself a Ph.D. and ahead of you lies nothing but fame and glory. – Mike Nakis Nov 13 '16 at 20:24
"Transforming assembly to anything but machine code is an extremely hard task". Is it your area of expertise or you just being offensive? (I'm a little confused of this "get yourself a Ph.D"). There are virtual machines that do binary translation, and assembly conversion tools. Why translating assembly into well defined byte code should be an impossible task? – Kostya Nov 13 '16 at 20:38
Sorry, I did not mean to be offensive. But sometimes I come across this way. I suppose I was not clear enough about what I meant. Tools that transform assembly tend to transform it into something of equal or lower level, not higher level. I postulate that you need to go to a higher level in order to be able to detect relationships between instructions so as to perform optimizations on them. But there is no law of nature that requires that, so it is an open field for research. (And that's what Ph.Ds do.) – Mike Nakis Nov 13 '16 at 20:57
2

@KostyaBazhanov: Mike is correct. Compilers transform C or C++ into an internal representation of what the code does, and optimizes that. binary to binary optimization is a much harder task, because there's a lot less information available. e.g. in C, the compiler knows that the values in temporary registers at the end of a function don't matter, only the return value. An asm to asm optimizer would have to figure out whether leaving a different value in EDX was ok or not when modifying the code for a function. – Peter Cordes Nov 13 '16 at 22:54
1

@KostyaBazhanov: Binary to binary optimizers exist, but are very rarely used (and aren't how gcc works). [This question](http://stackoverflow.com/questions/4394609/are-there-any-asm-compilers) is about that (but the answers aren't really what I'm thinking of. I haven't found any specific mentions of binary recompilation. IIRC, @ Ira Baxter has posted about that, and even developed such a tool). Basically, your mental model of how how compilers work is wrong, and that's why you thought it would be easy for them to optimize inline asm. So your question makes sense, but this is the answer. – Peter Cordes Nov 13 '16 at 23:02
@KostyaBazhanov: Left/right rotations are tricky to code in C without any undefined behaviour for any shift counts, while still compiling to just a single instruction. See http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c for a version that works. gcc (and clang and others) recognize that pattern of data movement, and seeing that they can use a ROL or ROR for it. Some compilers even have similar pattern-recognition for instructions like POPCNT, replacing a whole loop with one instruction. This is more than just a "peephole" asm->asm. – Peter Cordes Nov 13 '16 at 23:09

score 4 · Answer 2 · answered Nov 13 '16 at 21:08

Compilers don't optimize inline assembly because that would defeat the purpose. Inline assembly is used when the programmer feels that they know better than the compiler, either the programmer thinks they can generate better code or that the compiler isn't capable of generating the code they want. In the former case the programmer is optimizing the assembly code themselves, if the compiler isn't doing a good enough job of optimizing the equivalent C code then it's not likely it's going to be able to improve the assembly code. In later case there is no equivalent C code, the inline assembly is using instructions or other assembly features that the compiler isn't capable of generating. In that case it's also unlikely it that its going to understand what those instructions actually do in order to optimize the code.

No compiler is capable of translating inline assembly into its internal "byte code" as you suggested in a comment. GCC treats inline assembly as string to paste into it's assembly output. It has absolutely no understanding of the code inside the string. Clang doesn't normally generate assembly as output, so it has a builtin assembler, but it doesn't really understand the assembly code either. It just translates it into machine code which it inserts into the object file output. Microsoft's compiler is another that doesn't normally generate assembly code output, and it actually has an understanding of the assembly, but only to a limited extent. It only understands things like what registers the code uses, so the compiler do things like preserve registers used by the inline assembly. It doesn't know what the assembly code actually does.

If you want the compiler to optimize your code don't use inline assembly. Even if thers isn't a language feature that directly corresponds to the assembly code you want then compiler maybe able to generate it anyways, like Mike Nakis suggested in a comment about ROTL. You can also use intrinsics, functions that extend the language and correspond to various assembly instructions, which compilers are capable of optimizing in many cases.

Optimizer: optimize inline assembly

2 Answers2