All of those asm instructions need to be in the same asm
statement if you want to be sure they're contiguous (without compiler-generated code between them), and you need to declare input / output / clobber operands or you will step on the compiler's registers.
You can't use lea
or mov
to/from a C variable name (except for global / static symbols which are actually defined in the compiler's asm output, but even then you usually shouldn't).
Instead of using mov
instructions to set up inputs, ask the compiler to do it for you using input operand constraints. If the first or last instruction of a GNU C inline asm statement, usually that means you're doing it wrong and writing inefficient code.
And BTW, GNU C++ allows C99-style variable-length arrays, so howmany
is allowed to be non-const
and even set in a way that doesn't optimize away to a constant. Any compiler that can compile GNU-style inline asm will also support variable-length arrays.
How to write your loop properly
If this looks over-complicated, then https://gcc.gnu.org/wiki/DontUseInlineAsm. Write a stand-alone function in asm so you can just learn asm instead of also having to learn about gcc and its complex but powerful inline-asm interface. You basically have to know asm and understand compilers to use it correctly (with the right constraints to prevent breakage when optimization is enabled).
Note the use of named operands like %[ptr]
instead of %2
or %%ebx
. Letting the compiler choose which registers to use is normally a good thing, but for x86 there are letters other than "r"
you can use, like "=a"
for rax/eax/ax/al specifically. See https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html, and also other links in the inline-assembly tag wiki.
I also used buf_loop%=:
to append a unique number to the label, so if the optimizer clones the function or inlines it multiple places, the file will still assemble.
Source + compiler asm output on the Godbolt compiler explorer.
void ext(char *);
int foo(void)
{
int howmany = 5046; // could be a function arg
char buffer[howmany];
//ext(buffer);
const char *bufptr = buffer; // copy the pointer to a C var we can use as a read-write operand
unsigned char result;
asm("buf_loop%=: \n\t" // do {
" movb (%[ptr]), %%al \n\t" // Copy buffer[x] to al
" inc %[ptr] \n\t"
" dec %[count] \n\t"
" jnz buf_loop \n\t" // } while(ebx>0)
: [res]"=a"(result) // al = write-only output
, [count] "+r" (howmany) // input/output operand, any register
, [ptr] "+r" (bufptr)
: // no input-only operands
: "memory" // we read memory that isn't an input operand, only pointed to by inputs
);
return result;
}
I used %%al
as an example of how to write register names explicitly: Extended Asm (with operands) needs a double %
to get a literal %
in the asm output. You could also use %[res]
or %0
and let the compiler substitute %al
in its asm output. (And then you'd have no reason to use a specific-register constraint unless you wanted to take advantage of cbw
or lodsb
or something like that.) result
is unsigned char
, so the compiler will pick a byte register for it. If you want the low byte of a wider operand, you could use %b[count]
for example.
This uses a "memory"
clobber, which is inefficient. You don't need the compiler to spill everything to memory, only to make sure that the contents of buffer[]
in memory matches the C abstract machine state. (This is not guaranteed by passing a pointer to it in a register).
gcc7.2 -O3
output:
pushq %rbp
movl $5046, %edx
movq %rsp, %rbp
subq $5056, %rsp
movq %rsp, %rcx # compiler-emitted to satisfy our "+r" constraint for bufptr
# start of the inline-asm block
buf_loop18:
movb (%rcx), %al
inc %rcx
dec %edx
jnz buf_loop
# end of the inline-asm block
movzbl %al, %eax
leave
ret
Without a memory clobber or input constraint, leave
appears before the inline asm block, releasing that stack memory before the inline asm uses the now-stale pointer. A signal-handler running at the wrong time would clobber it.
A more efficient way is to use a dummy memory operand which tells the compiler that the entire array is a read-only memory input to the asm
statement. See get string length in inline GNU Assembler for more about this flexible-array-member trick for telling the compiler you read all of an array without specifying the length explicitly.
In C you can define a new type inside a cast, but you can't in C++, hence the using
instead of a really complicated input operand.
int bar(unsigned howmany)
{
//int howmany = 5046;
char buffer[howmany];
//ext(buffer);
buffer[0] = 1;
buffer[100] = 100; // test whether we got the input constraints right
//using input_t = const struct {char a[howmany];}; // requires a constant size
using flexarray_t = const struct {char a; char x[];};
const char *dummy;
unsigned char result;
asm("buf_loop%=: \n\t" // do {
" movb (%[ptr]), %%al \n\t" // Copy buffer[x] to al
" inc %[ptr] \n\t"
" dec %[count] \n\t"
" jnz buf_loop \n\t" // } while(ebx>0)
: [res]"=a"(result) // al = write-only output
, [count] "+r" (howmany) // input/output operand, any register
, "=r" (dummy) // output operand in the same register as buffer input, so we can modify the register
: [ptr] "2" (buffer) // matching constraint for the dummy output
, "m" (*(flexarray_t *) buffer) // whole buffer as an input operand
//, "m" (*buffer) // just the first element: doesn't stop the buffer[100]=100 store from sinking past the inline asm, even if you used asm volatile
: // no clobbers
);
buffer[100] = 101;
return result;
}
I also used a matching constraint so buffer
could be an input directly, and the output operand in the same register means we can modify that register. We got the same effect in foo()
by using const char *bufptr = buffer;
and then using a read-write constraint to tell the compiler that the new value of that C variable is what we leave in the register. Either way we leave a value in a dead C variable that goes out of scope without being read, but the matching constraint way can be useful for macros where you don't want to modify the value of your input (and don't need the type of your input: int dummy
would work fine, too.)
The buffer[100] = 100;
and buffer[100] = 101;
assignments are there to show that they both appear in the asm, instead of being merged across the inline-asm (which does happen if you leave out the "m"
input operand). IDK why the buffer[100] = 101;
isn't optimized away; it's dead so it should be. Also note that asm volatile
doesn't block this reordering, so it's not an alternative to a "memory"
clobber or using the right constraints.