x64 assembly segfault rbx vs rcx

Question

I am trying to implement my version of the strcpy function using x64 assembly on macos. I came across a SEGV error that I don't understand.

Here's my assembly code.

section .text
    global _ft_strcpy

_ft_strcpy:
    mov rax, rdi

loop:
    mov rbx, [rsi]
    mov [rdi], rbx
    inc rdi
    inc rsi
    cmp [rsi] , byte 0
    jne loop

end:
    mov [rdi], byte 0
    ret

Here's my main.c used for testing.

#include <stdio.h>
#include <string.h>

int main(void)
{
    char src [11] = "Hello moto";
    char dest [11];
    ft_strcpy(dest, src);

    printf("|%p|\n", src);
    printf("|%s|\n", src);
    printf("|%p|\n", dest);
    printf("|%s|\n", dest); 
    return (0);
}

The output of fsanitize.

=================================================================
==53615==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000097 (pc 0x0001087a2c5f bp 0x7ffee745d990 sp 0x7ffee745d880 T0)
==53615==The signal is caused by a READ memory access.
==53615==Hint: address points to the zero page.
    #0 0x1087a2c5e in main main.c:10
    #1 0x7fff796e33d4 in start (libdyld.dylib:x86_64+0x163d4)

==53615==Register values:
rax = 0x00007ffee745d8c0  rbx = 0x000000000000004f  rcx = 0x4f4d204f4c4c4500  rdx = 0x00001fffdce8bb10  
rdi = 0x00000001087a2e60  rsi = 0x00007ffee745d8aa  rbp = 0x00007ffee745d990  rsp = 0x00007ffee745d880  
 r8 = 0x00001fffdce8bb10   r9 = 0x00000001087a2e20  r10 = 0x0000000117f89c30  r11 = 0x00007ffddecbaa80  
r12 = 0x0000000000000000  r13 = 0x0000000000000000  r14 = 0x0000000000000000  r15 = 0x0000000000000000  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV main.c:10 in main

The segfault seems to occurs after my call to ft_strcpy, on the first printf call. When I'm using the rcx register instead of the rbx one (in my assembly code), this program works. I've looked up the difference between rcx and rbx (Caller-saved vs Callee-saved), but I don't understand why it causes this problem. What am I missing ?

Feel free to point out any bad practices, I'm taking any advices here!

Thanks for reading.

_"If the callee wishes to use registers RBX, RBP, and R12–R15, it must restore their original values before returning control to the caller."_ — Michael, Feb 10 '20 at 14:39
Oh..... thank you ! May I ask what resource did it come from ? — yorncl, Feb 10 '20 at 14:49
Wikipedia: https://en.wikipedia.org/wiki/X86_calling_conventions#System_V_AMD64_ABI — Michael, Feb 10 '20 at 14:49
Linux and MacOS both use the x86-64 System V ABI / calling convention. — Peter Cordes, Feb 10 '20 at 21:17

score 1 · Answer 1 · answered Feb 10 '20 at 15:08

1

Feel free to point out any bad practices, I'm taking any advices here!

The copy loop loads and stores 8 bytes at once, but steps in 1-byte increments.

answered Feb 10 '20 at 15:08

Maxim Egorushkin

131,725
17
180
271

Thanks ! I was wondering : is accessing the lower bits of a register less costly than accessing the whole register ? I'm sorry if this isn't clear, I'm kinda new to this. By the way, if you have any resource on the link between assembly and machine language, I'm taking it ! – yorncl Feb 10 '20 at 15:23
1

@yorncl I recommend [Optimizing subroutines in assembly language: An optimization guide for x86 platforms](https://www.agner.org/optimize/optimizing_assembly.pdf) by Agner Fog. It contains the information on encoding of instructions depending on the operand size and much more. – Maxim Egorushkin Feb 10 '20 at 15:29
Copying 8 bytes means you have buffer overflow by 7 bytes both in the source and destination. – Raymond Chen Feb 10 '20 at 15:34
1

@yorncl Accessing them for reading is no problem, but if you write to the lower 8, middle 8, or lower 16 bits of a register, an extra µop is needed to merge the written value with the remaining bit. This is not the case when writing to the lower 32 bits as the upper 32 bits are zeroed out in this case. Thus, when loading individual bytes from RAM, consider using `movzx` for better performance. – fuz Feb 10 '20 at 16:55
@yorncl: [Why doesn't GCC use partial registers?](//stackoverflow.com/q/41573502) and [How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent](//stackoverflow.com/q/45660139). In your case, copy 8 bytes at a time until you have less than 8 left to go. Then do an overlapping final 8 bytes if the total size allows. If you do need to load single bytes, use `movzx` to avoid false dependencies and/or other partial-register-write shenanigans. – Peter Cordes Feb 10 '20 at 21:20
@yorncl: have a look at how glibc memcpy is implemented, with a potentially-overlapping pair of load/store for small buffer (e.g. dwords for 4..7 bytes). See the comments in https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html – Peter Cordes Feb 10 '20 at 21:21

x64 assembly segfault rbx vs rcx

1 Answers1