The loop looks ok, but clunky. If your program is crashing, use a debugger. See the x86 for links and a quick intro to gdb for asm.
I think you're getting argv[1]
loaded correctly. (Note that this is the first command-line arg, though. argv[0]
is the command name.) https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames says ebp+12
is the usual spot for the 2nd arg to a 32bit functions that bother to set up stack frames.
Michael Petch commented on Simon's deleted answer that the asm_io library has print_int, print_string, print_char, and print_nl routines, among a few others. So presumably you a pointer to your buffer to one of those functions and call it a day. Or you could call sys_write(2)
directly with an int 0x80
instruction, since you don't need to do any string formatting and you already have the length.
Instead of incrementing separately for two arrays, you could use the same index for both, with an indexed addressing mode for the load.
;; main (int argc ([esp+4]), char **argv ([esp+8]))
... some code you didn't show that I guess pushes some more stuff on the stack
mov eax, dword [ebp+12] ; load argv
;; eax + 4 is &argv[1], the address of the 1st cmdline arg (not counting the command name)
mov esi, dword [eax + 4] ; esi holds 2nd arg, which is pointer to string
xor ecx, ecx
copystring:
mov al, [esi + ecx]
mov [X + ecx], al
inc ecx
test al, al
jnz copystring
I changed the comments to say "cmdline arg", to distinguish between those and "function arguments".
When it doesn't cost any extra instructions, use esi
for source pointers, edi
for dest pointers, for readability.
Check the ABI for which registers you can use without saving/restoring (eax, ecx, and edx at least. That might be all for 32bit x86.). Other registers have to be saved/restored if you want to use them. At least, if you're making functions that follow the usual ABI. In asm you can do what you like, as long as you don't tell a C compiler to call non-standard functions.
Also note the improvement in the end of the loop. A single jnz
to loop is more efficient than jz break / jmp
.
This should run at one cycle per byte on Intel, because test/jnz
macro-fuse into one uop. The load is one uop, and the store micro-fuses into one uop. inc
is also one uop. Intel CPUs since Core2 are 4-wide: 4 uops issued per clock.
Your original loop runs at half that speed. Since it's 6 uops, it takes 2 clock cycles to issue an iteration.
Another hacky way to do this would be to get the offset between X
and ebx
into another register, so one of the effective addresses could use a one-register addressing mode, even if the dest wasn't a static array.
mov [X + ebx + ecx], al
. (Where ecx = X - start_of_src_buf). But ideally you'd make the store the one that used a one-register addressing mode, unless the load was a memory operand to an ALU instruction that could micro-fuse it. Where the dest is a static buffer, this address-different hack isn't useful at all.
You can't use rep
string instructions (like rep movsb
) to implement strcpy for implicit-length strings (C null-terminated, rather than with a separately-stored length). Well you could, but only scanning the source twice: once for find the length, again to memcpy.
To go faster than one byte clock, you'd have to use vector instructions to test for the null byte at any of 16 positions in parallel. Google up an optimized strcpy implementation for example. Probably using pcmpeqb
against a vector register of all-zeros.