First of all, you should usually replace inline asm (with intrinsics or pure C) instead of porting it. https://gcc.gnu.org/wiki/DontUseInlineAsm
clang -fasm-blocks
is mostly compatible with MSVC's inefficient inline asm syntax. But it doesn't support returning a value by leaving it in EAX and then falling off the end of a non-void function.
So you have to write inline asm that puts the value in a named C variable and return
that, typically leading to an extra store/reload making MSVC syntax even worse. (Pretty bad unless you're writing a whole loop in asm that amortizes that store/reload overhead of getting data into / out of the asm block). See What is the difference between 'asm', '__asm' and '__asm__'? for a comparison of how inefficient MSVC inline-asm is when wrapping a single instruction. It's less dumb inside functions with stack args when those functions don't inline, but that only happens if you're already making things inefficient (e.g. using legacy 32-bit calling conventions and not using link-time optimization to inline small functions).
MSVC can substitute A
with an immediate 1
when inlining into a caller, but clang can't. Both defeat constant-propagation but MSVC at least avoids bouncing constant inputs through a store/reload. (As long as you only use it with instructions that can support an immediate source operand.)
Clang accepts __asm
, asm
, or __asm__
to introduce an asm-block. MSVC accepts __asm
(2 underscores like clang) or _asm
(more commonly used, but clang doesn't accept it).
So for existing MSVC code you probably want #define _asm __asm
so your code can compile with both MSVC and clang, unless you need to make separate versions anyway. Or use clang -D_asm=asm
to set a CPP macro on the command line.
Example: compile with MSVC or with clang -fasm-blocks
(Don't forget to enable optimization: clang -fasm-blocks -O3 -march=native -flto -Wall
. Omit or modify -march=native
if you want a binary that can run on earlier/other CPUs than your compile host.)
int a_global;
inline
long foo(int A, int B, int *arr) {
int out;
// You can't assume A will be in RDI: after inlining it prob. won't be
__asm {
mov ecx, A // comment syntax
add dword ptr [a_global], 1
mov out, ecx
}
return out;
}
Compiling with x86-64 Linux clang 8.0 on Godbolt shows that clang can inline the wrapper function containing the inline-asm, and how much store/reload MSVC syntax entails (vs. GNU C inline asm which can take inputs and outputs in registers).
I'm using clang in Intel-syntax asm output mode, but it also compiles Intel-syntax asm blocks when it's outputting in AT&T syntax mode. (Normally clang compiles straight to machine-code anyway, which it also does correctly.)
## The x86-64 System V ABI passes args in rdi, rsi, rdx, ...
# clang -O3 -fasm-blocks -Wall
foo(int, int, int*):
mov dword ptr [rsp - 4], edi # compiler-generated store of register arg to the stack
mov ecx, dword ptr [rsp - 4] # start of inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx # end of inline asm
movsxd rax, dword ptr [rsp - 8] # reload `out` with sign-extension to long (64-bit) : compiler-generated
ret
Notice how the compiler substituted [rsp - 4]
and [rsp - 8]
for the C local variables A
and out
in the asm source block. And that a variable in static storage gets RIP-relative addressing. GNU C inline asm doesn't do this, you need to declare %[name]
operands and tell the compiler where to put them.
We can even see clang inline that function twice into one caller, and optimize away the sign-extension to 64-bit because this function only returns int
.
int caller() {
return foo(1, 2, nullptr) + foo(1, 2, nullptr);
}
caller(): # @caller()
mov dword ptr [rsp - 4], 1
mov ecx, dword ptr [rsp - 4] # first inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx
mov eax, dword ptr [rsp - 8] # compiler-generated reload
mov dword ptr [rsp - 4], 1 # and store of A=1 again
mov ecx, dword ptr [rsp - 4] # second inline asm
add dword ptr [rip + a_global], 1
mov dword ptr [rsp - 8], ecx
add eax, dword ptr [rsp - 8] # compiler-generated reload
ret
So we can see that just reading A
from inline asm creates a missed-optimization: the compiler stores a 1
again even though the asm only read that input without modifying it.
I haven't done tests like assigning to or reading a_global
before/between/after the asm statements to make sure the compiler "knows" that variable is modified by the asm statement.
I also haven't tested passing a pointer into an asm block and looping over the pointed-to array, to see if it's like a "memory"
clobber in GNU C inline asm. I'd assume it is.
My Godbolt link also includes an example of falling off the end of a non-void function with a value in EAX. That's supported by MSVC, but is UB like usual for clang and breaks when inlining into a caller. (Strangely with no warning, even at -Wall
). You can see how x86 MSVC compiles it on my Godbolt link above.
Porting MSVC asm to GNU C inline asm is almost certainly the wrong choice. Compiler support for optimizing intrinsics is very good, so you can usually get the compiler to generate good-quality efficient asm for you.
If you're going to do anything to existing hand-written asm, usually replacing them with pure C will be most efficient, and certainly the most future-proof, path forward. Code that can auto-vectorize to wider vectors in the future is always good. But if you do need to manually vectorize for some tricky shuffling, then intriniscs are the way to go unless the compiler makes a mess of it somehow.
Look at the compiler-generated asm you get from intrinsics to make sure it's as good or better than the original.
If you're using MMX EMMS
, now is probably a good time to replace your MMX code with SSE2 intrinsics. SSE2 is baseline for x86-64, and few Linux systems are running obsolete 32-bit kernels.