How to traverse a string in assembly until I reach null? (strlen loop)

Question

Right now I'm just figuring out how to even traverse a string. If the code doesn't make sense, it's because I've interpreted some information wrong. At the worst, I don't really know what I'm doing.

strlen:

pushq %rbx
movq %rsi, %rbx


loop:
    cmp $0x00, (%rdi, %rbx)
    je end
    inc %rbx
    jmp loop

end:
    movq %rbx, %rax
    popq %rbx
    ret

PS: There's a reason my title looks like an old man 2nd time on his computer trying to search "how to go to google.com" Superrrr noob here trying to learn a little bit of assembly. I'm trying to implement strlen function for myself.

Check this out it will help you understand https://stackoverflow.com/a/40647017/10927635 — vishal, Mar 02 '20 at 04:59
cmp with an immediate operand requires the immediate to be the first (source) operand. Also, you can just use RDI as the pointer in the loop, or pick some call-clobbered register so you don't have to save/restore RBX. — Peter Cordes, Mar 02 '20 at 05:06
So is using %rbx correct in this case? What about using %ebx? I know it's just a convention in a sense but just for clarity. — Block o Butter, Mar 02 '20 at 05:11
[Why does rax and rdi work the same in this situation?](https://stackoverflow.com/q/54867399) has a working strlen in NASM syntax. It's not "magic", just increment a pointer. And yes, RBX is fine with `inc %rbx`, but using RAX would let you do the same thing without without having to save/restore RBX. And no, you don't want to use 32-bit registers for pointers. — Peter Cordes, Mar 02 '20 at 05:27
When I inc %rbx, am I incrementing by 1 byte? So therefore if I do cmp $0x00, (%rdi, % rbx), it should compare the pointer to null, correct? From what I understand, in att (%rdi, %rbx) is the same thing as (rdi + rbx). — Block o Butter, Mar 02 '20 at 06:01
RDI and RBX both hold pointers. Adding them together doesn't make a valid address! I didn't find any SO Q&As with a strlen implementation that didn't suck in some major way (e.g. 2 branches inside the loop like this one) so I wrote an answer. — Peter Cordes, Mar 02 '20 at 06:13

Peter Cordes · Accepted Answer · 2020-03-02T07:42:44.157

You just inc %rbx to increment the pointer value. (%rbx) dereferences that register, using its value as a memory address. On x86, every byte has its own address (this property is called "byte addressable"), and addresses are just integers that fit in a register.

The characters in an ASCII string are all 1 byte wide so incrementing a pointer by 1 moves to the next character in an ASCII string. (This isn't true in the general case of UTF-8 with characters outside the 1..127 range of codepoints, but ASCII is a subset of UTF-8.)

Terminology: ASCII code 0 is called NUL (one L), not NULL. In C, NULL is a pointer concept. C-style implicit-length strings can be described as 0-terminated or NUL-terminated, but "null-terminated" is misusing the terminology.

You should pick a different register (one that's call-clobbered) so you don't need to push/pop it around your function. Your code doesn't make any function calls, so there's no need to keep your induction variable in a call-preserved register.

I didn't find a good simple example in other SO Q&As. They either have 2 branches inside the loop (including one unconditional jmp) like the one I linked in comments, or they waste instructions incrementing a pointer and a counter. Using an indexed addressing mode inside the loop is not terrible, but is less efficient on some CPUs so I'd still recommend doing a pointer increment -> subtract end-start after the loop.

This is how I'd write a minimal strlen that only checks 1 byte at a time (slow and simple). I kept the loop itself small, and this is IMO a reasonable example of a good way of writing loops in general. Often keeping your code compact makes it easier to understand a function in asm. (Give it a different name than strlen so you can test it without needing gcc -fno-builtin-strlen or whatever.)

.globl simple_strlen
simple_strlen:
    lea     -1(%rdi), %rax     # p = start-1 to counteract the first inc
 .Lloop:                       # do {
    inc     %rax                  # ++p
    cmpb    $0, (%rax)
    jne     .Lloop             # }while(*p != 0);
                           # RAX points at the terminating 0 byte = one-past-end of the real data
    sub     %rdi, %rax     # return length = end - start
    ret

The return value of strlen is the array index of the 0 byte = length of the data not including the terminator.

If you were inlining this manually (because it's just a 3-instruction loop), you'd often just want a pointer to the 0 terminator so you wouldn't bother with the sub crap, just use RAX at the end of the loop.

Avoiding the offsetting LEA/INC instructions before the first load (which cost 2 cycles of latency before the first cmp) could be done by peeling the first iteration, or with a jmp to enter the loop at the cmp/jne, after the inc. Why are loops always compiled into "do...while" style (tail jump)?.

Incrementing the pointer with LEA between cmp/jcc (like cmp ; lea 1(%rax), %rax ; jne) could be worse because it defeats macro-fusion of cmp/jcc into a single uop. (Actually, macro-fusion of cmp $imm, (%reg) / jcc doesn't happen on Intel CPUs like Skylake anyway. cmp micro-fuses the memory operand, though. Maybe AMD fuses the cmp/jcc.) Also, you'd leave the loop with RAX 1 higher than you want.

So it would be just as efficient (on Intel Sandybridge-family) to movzx (aka movzbl) load and zero-extend a byte into %ecx, and test %ecx, %ecx / jnz as the loop condition. But larger code-size.

Most CPUs will run my loop at 1 iteration per clock cycle. We could maybe get close to 2 bytes per cycle (while still only checking each byte separately) with some loop unrolling.

Checking 1 byte at a time is about 16x slower for large strings than we could go with SSE2. If you aren't aiming for minimal code size and simplicity, see Why is this code 6.5x slower with optimizations enabled? for a simple SSE2 strlen that uses an XMM register. SSE2 is baseline for x86-64 so you should always use it when it gives a speedup, for stuff that's worth writing by hand in asm.

Re: your updated question with a buggy port of the implementation from Why does rax and rdi work the same in this situation?

RDI and RBX both hold pointers. Adding them together doesn't make a valid address! In the code you were trying to port, RCX (the index) is initialized to zero before the loop. But instead of xor %ebx, %ebx, you did mov %rdi, %rbx. Use a debugger to examine register values while you single-step your code.

Thank you! Your explanation made sense. – Block o Butter Mar 02 '20 at 17:01 — Block o Butter, Mar 02 '20 at 17:01

How to traverse a string in assembly until I reach null? (strlen loop)

1 Answers1

Linked

Related