Don't ret
from inside an asm
block or statement, except with __attribute__((naked))
.
Don't make assumptions about [ebp+x]
holding certain C variables in non-naked inline asm.
@Dean Pucsek's answer (using [ebp+8]
, 12, 16 in a non-naked function) may happen to work with optimization disabled, but it breaks spectacularly1 in a normal -O2
build. (doStuff
inlines into its caller, which may have different args at [ebp + 8]
, 12, and 16). Unless you happen to have it in a separate compilation unit from any caller, and don't use -flto
.
You have three options, other than not using inline asm for such a trivial thing in the first place:
- Remove the
ret
and change it to use named C variables for the args, instead of making assumptions about this function not inlining. (You can also use __attribute__((noinline))
if you want it not to inline for some reason, but there's no benefit to hard-coding the calling convention, and no need if not in a naked
function).
- Move the asm to a separate
.s
file, and just declare a prototype in C or C++.
- Use
__attribute__((naked))
which is now supported for x86 by clang. (And GCC, but mainline GCC itself doesn't support -fasm-blocks
. You're using an Apple version of GCC, or actually Clang installed as gcc
for shell scripts / Makefiles like current MacOS does.)
Converting it to actual inline asm without naked
or noinline
will let it inline (unlike options 2 or 3) so it's a bit less inefficient, but it's still an asm block, not a GNU C asm statement like this (see the Godbolt link below for this in action inside a function)
asm("and %[flags], %[outval]" // AT&T syntax: op src, dst. clang always parses this way, gcc with -masm=intel treats inline asm as Intel-syntax
: [outval]"=r"(*(unsigned char(*)[4])result) // 4-element uchar array. Normally type-pun deref is strict-aliasing UB, but GCC documents this for asm.
: "0"(val) /*pick same register as output 0*/, [flags]"r" (flags) // reg, mem, or immediate source
: // no clobbers
);
It's a bit weird that your current asm stores 4 bytes to an unsigned char*
output (which is why I couldn't just use "=r"(*result)
), but I guess it's a char array, or actually pointing to an unaligned dword somewhere? I preserved that instead of telling the compiler we only want the low byte of the output.
That would let the compiler have both inputs in registers and handle the mov
instructions for you. See https://stackoverflow.com/tags/inline-assembly/info. But of course it's still an opaque inline asm statement that the optimizer can't see through or anything, so https://gcc.gnu.org/wiki/DontUseInlineAsm if you can possibly avoid it.
With an "rmi"
or "r,m,i"
constraint, actual GCC would be able to smartly choose an immediate or memory, but clang is dumb about this and always picks memory. Or for x,y,z
always picks the first option, which is why I put register first.
You wouldn't have this problem if using val &= flags;
memcpy(result, &val, sizeof(val));
- both GCC and clang could optimally do the and
with minimal wasted mov
instructions.
Still using MSVC-style asm-blocks
If you really want inefficient MSVC-style asm-blocks stuff that forces the compiler to have the inputs and outputs in memory, not registers, use
// without __attribute__((naked))
void
doStuff(unsigned long int val, unsigned long int flags, unsigned char *result)
{
__asm // don't push/pop inside the asm block: the compiler still sees all touched registers as clobbered and saves itself if necessary
{
mov eax, [val] // compiler will fill in [esp+x] or [ebp+y] or whatever for C var names
and eax, [flags] // memory-source AND is fine
mov ecx, [result] // load the pointer variable
mov [ecx], eax // deref it, storing 4 bytes to result[0..3]
}
// for non-void functions: beware MSVC supports falling off the end without a return statement, with a value left in EAX by an asm{} block. (Even respecting that EAX result after inlining this function into another).
// clang -fasm-blocks doesn't: it's undefined behaviour in C++, or in C if the caller uses it.
}
Godbolt with clang14.0 -O3 -m32 -fasm-blocks -Wall -fno-pie
- the stand-alone version is about as efficient as it could be, given the clunky stack-args calling convention and the asm{}
block which doesn't let it do any better.
doStuff(unsigned long, unsigned long, unsigned char*):
// start of inline asm
mov eax, dword ptr [esp + 4]
and eax, dword ptr [esp + 8]
mov ecx, dword ptr [esp + 12]
mov dword ptr [ecx], eax
// end of inline asm
ret
A test caller shows that it inlines safely (but inefficiently as you'd expect from an asm{}
block):
// test caller
unsigned char global_charbuf[4];
void foo(unsigned long int flags, unsigned char *result) {
// result unused, unless you edit to pass it instead
doStuff(1234, flags, global_charbuf); // safe after inlining: stores 1234 to the stack
}
foo(unsigned long, unsigned char*):
sub esp, 12 // space for uchar *result local var
mov eax, dword ptr [esp + 16] // foo's flags arg
mov dword ptr [esp + 8], 1234
mov dword ptr [esp + 4], eax // This copy seems unnecessary; asm should be able to simply reference foo's stack arg
mov dword ptr [esp], offset global_charbuf // It's not preserving their relative order.
// start of inline asm
mov eax, dword ptr [esp + 8]
and eax, dword ptr [esp + 4]
mov ecx, dword ptr [esp]
mov dword ptr [ecx], eax
// end of inline asm
add esp, 12
ret
The same test caller passing the same args to a wrapper function using GNU C asm("..." :outputs :inputs :clobbers)
compiles to not perfect but much less horrible asm. The the Godbolt link for source for that.
bar(unsigned int, unsigned char*)
mov eax, dword ptr [esp + 4] # compiler-generated loads of the "r" register inputs
mov ecx, 1234 # clang is incapable of getting inline asm to use an AND ecx, 1234 unless we *only* allow an immediate source. Or maybe use __builtin_constant_p() around multiple separate blocks.
// start of asm
and ecx, eax # =r output picked EAX, "r" input picked ECX
// end of asm
mov dword ptr [global_charbuf], ecx # Compiler-generated store of the "=r" output
ret
You shouldn't manually push
/pop
registers inside an asm{}
block, with either MSVC or clang -fasm-blocks
. They both parse your asm and figure out which registers are actually written, and treat the block as clobbering those registers. (And if necessary, have the containing function save/restore any of those registers that are call-preserved in the calling convention, in whatever function this eventually inlines into.)
Apparently some really early C++ implementations, like I think Borland Turbo C++, did not parse your asm, so you did need to push/pop manually to avoid stepping on the compiler's toes. But that sounds even more clunky and inefficient; fortunately those days are long passed.
Footnote 1: broken asm if you get this wrong
// FIXME: use __attribute__((naked)) and put the RET back in, with [ESP+x] addressing
/*__declspec(naked)*/ void
broken_doStuff(unsigned long int val, unsigned long int flags, unsigned char *result)
{
__asm
{
push ebx // keep one unnecessary push, just the one you'd need in an actual naked function to not violate the calling convention. EAX and ECX are call-clobbered. So is EDX, but not EBX.
mov eax, dword ptr[ebp + 8] //val
mov ebx, dword ptr[ebp + 12] //flags
mov ecx, dword ptr[ebp + 16] //result
and eax, ebx
mov [ecx], eax
pop ebx
}
}
void foo(unsigned long int flags, unsigned char *result) {
doStuff(1234, flags, result);
// broken: inlines the asm that assumes 3 args on the stack
}
Compiled on Godbolt with clang14.0 -m32 -O3 -fasm-blocks
. The stand-alone definition of doStuff
is fine; inefficient but works. The problem is when it inlines into foo
:
foo(unsigned long, unsigned char*):
push ebp // a push or pop in the inline asm makes clang use EBP as a frame pointer
mov ebp, esp
push ebx // generated by clang, since asm writes this call-preserved reg
// no mov dst, 1234 anywhere:
// clang doesn't see doStuff using its "val" arg
// Start of inline asm
push ebx
mov eax, dword ptr [ebp + 8] // wants val, actually loads foo's first arg, flags
mov ebx, dword ptr [ebp + 12]
mov ecx, dword ptr [ebp + 16] // these also access the wrong things
and eax, ebx
mov dword ptr [ecx], eax
pop ebx
// end of inline asm
pop ebx // compiler-generated epilogue
pop ebp // not leave, it assumes inline asm balanced the stack, and knows it didn't allocate any stack space itself.
ret
This will normally crash when it tries to store using a unsigned char *result
that it loaded garbage for, from somewhere in the caller's stack frame above foo
's stack args.
I left in the push/pop to demonstrate that the compiler already sees EBX being written and saves/restores it. If you take that out, it also doesn't set up EBP as a frame pointer, since it knows the stack isn't moving inside the asm statement so things like val
can expand to [esp+y]
instead of [ebp+x]
.
This also doesn't respect -mregparm=3
to use a register-arg calling convention. A regular asm{ mov eax, [val] }
would, although the compiler would still just spill val
to memory because the whole semantics of asm{}
blocks are designed around all inputs being in memory, no registers occupied with inputs from the start.