Return a pointer at a specific position - Assembly

Question

I am a beginner in Assembly and i have a simple question. This is my code :

BITS 64                     ; 64−bit mode
global strchr               ; Export 'strchr'

SECTION .text           ; Code section
strchr:
    mov rcx, -1
.loop:
    inc rcx
    cmp byte [rdi+rcx], 0
    je exit_null
    cmp byte [rdi+rcx], sil
    jne .loop
    mov rax, [rdi+rcx]
    ret

exit_null:
    mov rax, 0
    ret

This compile but doesn't work. I want to reproduce the function strchr as you can see. When I test my function with a printf it crashed ( the problem isn't the test ). I know I can INC rdi directly to move into the rdi argument and return it at the position I want. But I just want to know if there is a way to return rdi at the position rcx to fix my code and probably improve it.

Doesn’t work is too vague. What happens, what did you expect? — 500 - Internal Server Error, Feb 20 '22 at 18:58
@500-InternalServerError Yeah sorry, i just add some modifications and explanations. — Vroxy, Feb 20 '22 at 19:13

score 3 · Accepted Answer · answered Feb 20 '22 at 18:58

Your function strchr seems to expect two parameters:

pointer to a string in RDI, and
pointer to a character in RSI.

Register rcx is used as index inside the string? In this case you should use al instead of cl. Be aware that you don't limit the search size. When the character refered by RSI is not found in the string, it will probably trigger an exception. Perhaps you should test al loaded from [rdi+rcx] and quit further searching when al=0.

If you want it to return pointer to the first occurence of character inside the string, just
replace mov rax,[rdi+rcx] with lea rax,[rdi+rcx].

The question has since been edited to invalidate part of this answer: it now takes a character in SIL like `strchr` should. Leaving only inefficiencies and the `mov` vs. `lea` bug you mention at the end. — Peter Cordes, Feb 20 '22 at 21:37

Erik Eidt · Answer 2 · 2022-02-21T01:32:36.183

3

Your code (from edit Version 2) does the following:

char* strchr ( char *p, char x ) {
   int i = -1;
   do {
      if ( p[i] == '\0' ) return null;
      i++;
   } while ( p[i] != x );
   return * (long long*) &(p[i]);
}

As @vitsoft says, your intention is to return a pointer, but in the first return (in assembly) is returning a single quad word loaded from the address of the found character, 8 characters instead of an address.

It is unusual to increment in the middle of the loop. It is also odd to start the index at -1. On the first iteration, the loop continue condition looks at p[-1], which is not a good idea, since that's not part of the string you're being asked to search. If that byte happens to be the nul character, it'll stop the search right there.

If you waited to increment until both tests are performed, then you would not be referencing p[-1], and you could also start the index at 0, which would be more usual.

You might consider capturing the character into a register instead of using a complex addressing mode three times.

Further, you could advance the pointer in rdi and forgo the index variable altogether.

Here's that in C:

char* strchr ( char *p, char x ) {
    for(;;) {
        char c = *p;
        if ( c == '\0' )
            break;
        if ( c == x )
            return p;
        p++;
   }
   return null;
}

edited Feb 21 '22 at 01:32

answered Feb 20 '22 at 19:45

Erik Eidt

23,049
2
29
53

Note exactly `return p[i];`, it's doing `memcpy(retval, p+i, 8)`, rather than a `movsx` byte load. (`memcpy` is the only way C can express an unaligned aliasing-safe 8-byte load of a `char*`.) – Peter Cordes Feb 20 '22 at 21:29
1

Also, no, the asm isn't looking at `p[-1]`, that's a bug you introduced when turning the good `do{}while()` asm logic into a C `while()` loop. Using a `-1` offset to counterbalance incrementing first is often a good way to do an increment before a compare/branch. Of course, load into a register and then increment the pointer would be way better, avoiding RCX altogether. `char c = *p++;` – Peter Cordes Feb 20 '22 at 21:31
Oh, [revision 2](https://stackoverflow.com/revisions/71196899/2) of the question did have that `p[-1]` bug; I see the question's code changed dramatically. https://stackoverflow.com/posts/71196899/revisions. Revision 1 didn't have it, neither does the current (rev 4). – Peter Cordes Feb 20 '22 at 21:39
@PeterCordes, Yes, you're right, that is not `p[i]` but `*(long long*)&p[i]`, good catch. It did have the `p[-1]` issue when I wrote that -- the question is morphing.. – Erik Eidt Feb 20 '22 at 22:42
None of the versions ever had a `while()` structure, though. They all had a conditional branch at the bottom, like a `do{}while()`. (With the loop body being an `if() return` either way of looking at it.) – Peter Cordes Feb 20 '22 at 22:58
@PeterCordes, Version 2 had a nul check at the very beginning of the loop, arguably a while, no? – Erik Eidt Feb 20 '22 at 23:15
A while loop *always* unconditionally jumps back to the condition at the top. There's no `jmp` in sight, so that's how we can tell the difference between a `while(){ if() return; }` vs. `do{ if()return; }while();`. For it to be a `while()`, you'd have to argue that the jump at the bottom is a result of optimizing `jcc forward` / `jmp .loop` into a single backward `jcc` with the opposite condition. But if we're talking about optimizing anything, then all bets are off and the C didn't exactly describe the asm. The asm does correspond directly to a do{}while with an if as the start of the body. – Peter Cordes Feb 20 '22 at 23:22
@PeterCordes, ok, I take your point. – Erik Eidt Feb 20 '22 at 23:27

score 2 · Answer 3 · answered Feb 20 '22 at 20:18

Thanks to your help, I finally did it ! Thanks to the answer of Erik, i fixed a stupid mistake. I was comparing str[-1] to NULL so it was making an error. And with the answer of vitsoft i switched mov to lea and it worked ! There is my code :

strchr:
    mov rcx, -1
.loop:
    inc rcx
    cmp byte [rdi+rcx], 0
    je exit_null
    cmp byte [rdi+rcx], sil
    jne .loop
    lea rax, [rdi+rcx]
    ret

exit_null:
    mov rax, 0
    ret

Peter Cordes · Answer 4 · 2022-02-20T22:33:50.933

The only bug remaining in the current version is loading 8 bytes of char data as the return value instead of just doing pointer math, using mov instead of lea. (After various edits removed and added different bugs, as reflected in different answers talking about different code).

But this is over-complicated as well as inefficient (two loads, and indexed addressing modes, and of course extra instructions to set up RCX).
Just increment the pointer since that's what you want to return anyway.

If you're going to loop 1 byte at a time instead of using SSE2 to check 16 bytes at once, strchr can be as simple as:

;; BITS 64 is useless unless you're writing a kernel with a mix of 32 and 64-bit code
;; otherwise it only lets you shoot yourself in the foot by putting 64-bit machine code in a 32-bit object file by accident.

global mystrchr
mystrchr:
 .loop:                     ; do {
    movzx  ecx, byte [rdi]   ; c = *p;
    cmp    cl, sil           ; if (c == needle) return p;
    je     .found
    inc    rdi               ; p++
    test   cl, cl
    jnz    .loop            ; }while(c != 0)

      ;; fell out of the loop on hitting the 0 terminator without finding a match
    xor    edi, edi         ; p = NULL
    ; optionally an extra ret here, or just fall through

 .found:
    mov    rax, rdi         ; return p
    ret

I checked for a match before end-of-string so I'd still have the un-incremented pointer, and not have to decrement it in the "found" return path. If I started the loop with inc, I could use an [rdi - 1] addressing mode, still avoiding a separate counter. That's why I switched up the order of which branch was at the bottom of the loop vs. your code in the question.

Since we want to compare the character twice, against SIL and against zero, I loaded it into a register. This might not run any faster on modern x86-64 which can run 2 loads per clock as well as 2 branches (as long as at most one of them is taken).

Some Intel CPUs can micro-fuse and macro-fuse cmp reg,mem / jcc into a single load+compare-and-branch uop for the front-end, at least when the memory addressing mode is simple, not indexed. But not cmp [mem], imm/jcc, so we're not costing any extra uops for the front-end on Intel CPUs by separately loading into a register. (With movzx to avoid a false dependency from writing a partial register like mov cl, [rdi])

Note that if your caller is also written in assembly, it's easy to return multiple values, e.g. a status and a pointer (in the not-found case, perhaps to the terminating 0 would be useful). Many C standard library string functions are badly designed, notably strcpy, to not help the caller avoid redoing length-finding work.

Especially on modern CPUs with SIMD, explicit lengths are quite useful to have: a real-world strchr implementation would check alignment, or check that the given pointer isn't within 16 bytes of the end of a page. But memchr doesn't have to, if the size is >= 16: it could just do a movdqu load and pcmpeqb.

See Is it safe to read past the end of a buffer within the same page on x86 and x64? for details and a link to glibc strlen's hand-written asm. Also Find the first instance of a character using simd for real-world implementations like glibc's using pcmpeqb / pmovmskb. (And maybe pminub for the 0-terminator check to unroll over multiple vectors.)

SSE2 can go about 16x faster than the code in this answer for non-tiny strings. For very large strings, you might hit a memory bottleneck and "only" be about 8x faster.

Return a pointer at a specific position - Assembly

4 Answers4

Linked