You just inc %rbx
to increment the pointer value. (%rbx)
dereferences that register, using its value as a memory address. On x86, every byte has its own address (this property is called "byte addressable"), and addresses are just integers that fit in a register.
The characters in an ASCII string are all 1 byte wide so incrementing a pointer by 1 moves to the next character in an ASCII string. (This isn't true in the general case of UTF-8 with characters outside the 1..127 range of codepoints, but ASCII is a subset of UTF-8.)
Terminology: ASCII code 0
is called NUL (one L), not NULL. In C, NULL is a pointer concept. C-style implicit-length strings can be described as 0-terminated or NUL-terminated, but "null-terminated" is misusing the terminology.
You should pick a different register (one that's call-clobbered) so you don't need to push/pop it around your function. Your code doesn't make any function calls, so there's no need to keep your induction variable in a call-preserved register.
I didn't find a good simple example in other SO Q&As. They either have 2 branches inside the loop (including one unconditional jmp) like the one I linked in comments, or they waste instructions incrementing a pointer and a counter. Using an indexed addressing mode inside the loop is not terrible, but is less efficient on some CPUs so I'd still recommend doing a pointer increment -> subtract end-start after the loop.
This is how I'd write a minimal strlen that only checks 1 byte at a time (slow and simple). I kept the loop itself small, and this is IMO a reasonable example of a good way of writing loops in general. Often keeping your code compact makes it easier to understand a function in asm. (Give it a different name than strlen
so you can test it without needing gcc -fno-builtin-strlen
or whatever.)
.globl simple_strlen
simple_strlen:
lea -1(%rdi), %rax # p = start-1 to counteract the first inc
.Lloop: # do {
inc %rax # ++p
cmpb $0, (%rax)
jne .Lloop # }while(*p != 0);
# RAX points at the terminating 0 byte = one-past-end of the real data
sub %rdi, %rax # return length = end - start
ret
The return value of strlen
is the array index of the 0
byte = length of the data not including the terminator.
If you were inlining this manually (because it's just a 3-instruction loop), you'd often just want a pointer to the 0 terminator so you wouldn't bother with the sub crap, just use RAX at the end of the loop.
Avoiding the offsetting LEA/INC instructions before the first load (which cost 2 cycles of latency before the first cmp) could be done by peeling the first iteration, or with a jmp
to enter the loop at the cmp/jne, after the inc. Why are loops always compiled into "do...while" style (tail jump)?.
Incrementing the pointer with LEA between cmp/jcc (like cmp
; lea 1(%rax), %rax
; jne
) could be worse because it defeats macro-fusion of cmp/jcc into a single uop. (Actually, macro-fusion of cmp $imm, (%reg)
/ jcc doesn't happen on Intel CPUs like Skylake anyway. cmp
micro-fuses the memory operand, though. Maybe AMD fuses the cmp/jcc.) Also, you'd leave the loop with RAX 1 higher than you want.
So it would be just as efficient (on Intel Sandybridge-family) to movzx
(aka movzbl
) load and zero-extend a byte into %ecx
, and test %ecx, %ecx
/ jnz
as the loop condition. But larger code-size.
Most CPUs will run my loop at 1 iteration per clock cycle. We could maybe get close to 2 bytes per cycle (while still only checking each byte separately) with some loop unrolling.
Checking 1 byte at a time is about 16x slower for large strings than we could go with SSE2. If you aren't aiming for minimal code size and simplicity, see Why is this code 6.5x slower with optimizations enabled? for a simple SSE2 strlen that uses an XMM register. SSE2 is baseline for x86-64 so you should always use it when it gives a speedup, for stuff that's worth writing by hand in asm.
Re: your updated question with a buggy port of the implementation from Why does rax and rdi work the same in this situation?
RDI and RBX both hold pointers. Adding them together doesn't make a valid address! In the code you were trying to port, RCX (the index) is initialized to zero before the loop. But instead of xor %ebx, %ebx
, you did mov %rdi, %rbx
. Use a debugger to examine register values while you single-step your code.