0

i'm trying to learn some assembly (intel syntax, x86_64). I wrote this code to make a simple implementation of strcmp():

section .text
    global ft_strcmp

ft_strcmp:
    cmp byte [rsi], 0
    je  exit
    mov cl, byte [rsi]
    cmp byte [rdi], cl
    jne exit
    inc rsi
    inc rdi
    jmp ft_strcmp

exit:
    xor rax, rax
    mov al, [rdi]
    sub al, byte [rsi]
    ret

But trying it by calling ft_strcmp("Hello", "Hellooooo") return 145, whereas the real strcmp() return -1, and I cant seems to figure why. Am I wrong with my syntax, or is the way I try to do this ?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Naka
  • 3
  • 2
  • Except for the first check for the end of string (reaching a NUL byte), you're operating on qwords (8 bytes). You need to also operate on bytes when comparing and returning the difference. It's probably not returning 0 but you're likely only printing the lower 32-bits of the result. – user786653 Jan 25 '21 at 14:35
  • Thanks @user786653, this got me on a better path. I edited my question accordingly, but the result is not correct yet. I guess it's because one byte can store values from 0 to 255, and a negative value cause it to loop around. How can i preserve the sign for my return value while still working on 1 byte ? – Naka Jan 25 '21 at 15:29
  • 1
    You need to return a signed 32-bit result, so zero-extend both bytes to positive 32-bit integers, and then do a 32-bit subtract. `exit: movzx eax, byte [rdi] ; movzx ecx, byte [rsi] ; sub eax, ecx`. – Nate Eldredge Jan 25 '21 at 15:48
  • Thanks @NateEldredge, that was the solution i was looking for ! movzx was not an operand listed on the doc i used, but looking it up, it all make sens. I'm sorry, i'm new to stackoverflow, I dont know how to flag your comment as the right answer... – Naka Jan 25 '21 at 15:54

4 Answers4

2

strcmp is supposed to return a 32-bit int in eax which is positive or negative according to whether the first string is greater or less. By doing an 8-bit subtract, the upper 24 bits of eax remain zero, so that the result is positive when viewed as a signed integer.

You want to do a 32-bit subtract, so you need both bytes in 32-bit registers with their upper 24 bits zeroed. This is efficient to do with movzx:

exit:
    movzx eax, byte [rdi]
    movzx ecx, byte [rsi]
    sub eax, ecx
    ret

If you didn't know about movzx, you could zero the whole register and then load the low byte:

exit:
    xor eax, eax
    mov al, [rdi] ; 'byte' is unnecessary, operand size inferred from register al
    xor ecx, ecx
    mov cl, [rsi]
    sub eax, ecx
    ret

(As a side comment, zeroing instructions like xor rax, rax can be replaced by xor eax, eax which is smaller and has the same effect: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?)

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
1

The straightforward implementation - using Compiler Explorer :

strcmp:
        xor     ecx, ecx
.L3:
        movzx   eax, BYTE PTR [rdi+rcx]
        movzx   edx, BYTE PTR [rsi+rcx]
        cmp     al, dl
        jne     .L2
        inc     rcx
        test    al, al
        jne     .L3
.L2:
        sub     eax, edx
        ret

Bytes are correctly zero extended so that the return value is the difference of unsigned characters with results promoted to an int.

As others have noted, for C library implementations where C strings may have arbitrary length, it's possible to use machine words or SSE instructions, provided the setup cost is amortized for some length threshold.

Brett Hale
  • 21,653
  • 2
  • 61
  • 90
  • Thanks. Since I'm just starting, i'll stick to this easier version, but SSE seems really interesting, I need to read more about it – Naka Jan 26 '21 at 16:23
0

You were right, the 2nd string nul check falls out of the compare operation, slick!

Consider the following:

Fewer memory references the better: Your code has 3 data memory references in the loop and 2 on exit (3N+2). This code has 2 data memory references in the loop and 0 on exit (2N).

Fewer instruction bytes the better: Your code has 37 bytes of loop code and 14 bytes of exit code. This code has 20 bytes of loop code and 3 bytes of exit code.

One trick is to use the return value register as one of the working registers. Another trick is to move the operands into registers and then do operations in the registers.

Compilations were done at https://defuse.ca/online-x86-assembler.htm#disassembly

Code was not tested.

ft_strcmp:
    movzx cl,[rsi]      #[2] get byte(1) from 1st string
    movzx al,[rdi]      #[2] get byte(2) from 2nd string
    test cl,cl          #[2] end of first string, is it nul?
    je exit             #[2] 
    cmp cl,al           #[2] compare byte(1) w/ byte(2)
    jne exit            #[2] differ?
    inc rsi             #[3] point to next byte
    inc rdi             #[3] ditto
    jmp ft_strcmp       #[2] test them
                        #[20] bytes of instructions in loop
exit:
    sub al, cl          #[2] generate return value
                        # if neither null; return difference of values
                        # if byte(1) is null and byte(2) is null; ret 0
                        # if byte(1) is null and byte(2) is not null; return positive
                        # if byte(1) is not null and byte(2) is null; return negative 
    ret                 #[1]
                        #[3] bytes of instructions in exit
    
    
ft_strcmp:
    cmp byte [rsi], 0   #[7]
    je  exit            #[2]
    mov cl, byte [rsi]  #[6]
    cmp byte [rdi], cl  #[6]
    jne exit            #[2]
    inc rsi             #[6]
    inc rdi             #[6]
    jmp ft_strcmp       #[2]
                        #[37] bytes of loop code
exit:
    xor rax, rax        #[2]
    mov al, [rdi]       #[5]
    sub al, byte [rsi]  #[6]
    ret                 #[1]
                        #[14] bytes of exit code
jfwfmt
  • 84
  • 7
  • 1
    `movzx cl,[rsi]` doesn't assemble; movzx's destination has to be at least 16 bits wide. You want 32-bit dword in this case `movzx ecx, byte [rsi]`. (You have to specify whether the source is a byte or word, or even dword). Also, you need to `sub eax, ecx` to take advantage of the zero-extending loads you did. – Peter Cordes Jan 25 '21 at 21:16
  • But yes, you want to only load once, and reuse those values for the return value. However, if you're going to talk about optimizing, don't forget [Why are loops always compiled into "do...while" style (tail jump)?](https://stackoverflow.com/q/47783926) - the loop should be "rotated" to put one of the JCCs at the bottom. – Peter Cordes Jan 25 '21 at 21:19
  • Of course it's possible to do *much* better than 1 byte at a time, using SSE2 `pcmpeqb` / `pmovmskb` to test for any mismatches or any zero byte, like glibc's hand-written implementation does. But properly optimizing the simple byte loop is still a useful learning exercise. – Peter Cordes Jan 25 '21 at 21:20
-2

You also need to check [rdi] for null. The second string may be shorter than the first.

jfwfmt
  • 84
  • 7