Why my own memcpy written on NASM can not copy more than 340000000 bytes?

Question

I am learning nasm. I have written a simple function that copies memory from the source to the destination. I test in in C.

            section .text
            global _myMemcpy

_myMemcpy:
            mov eax, [esp + 4]
            mov ecx, [esp + 8]
            add [esp + 12], eax
            lp:
                   mov dl, [ecx]
                   mov [eax], dl

                   inc eax
                   inc ecx
                   cmp eax, [esp + 12]
                   jl lp
            endlp:
                    mov eax, [esp + 4]
                    ret

And the C program:

#include <string.h>
#define Times 340000000
extern void* _myMemcpy(void* dest, void* src, size_t size);
char sr[Times];
char ds[Times];
int main(void)
{
    memset(sr, 'a', Times);
    _myMemcpy(ds, sr, Times);
    return 0;
}

I am currently using Ubuntu OS. When I compile and link the two files with $ nasm -f elf m.asm && gcc -Wall -m32 m.o p.c && ./a.out it works fine when the value of Times is less than 340000000. When it is greater, _myMemcpy copies only the furst byte of the source to the destination. I can't figure out where is the problem. Every suggestion will by useful.

You're doing signed compares on pointers; with huge arrays like `size = 0x1443fd00`, one of them will span the 2GiB boundary (signed wraparound) unless the linker took special care to put one in the high half and the other in the low half. But it doesn't, it will make .bss contiguous. — Peter Cordes, Apr 29 '21 at 09:14
@PaulHankin: This is 32-bit code; you can tell by the stack args and by not segfaulting when using 32-bit pointers as addresses. Times is a macro: `340000000` is `0x1443fd00`. As a C literal, it fits in a 32-bit `int` so it has type `int` on any C implementation with int being at least 32 bits. — Peter Cordes, Apr 29 '21 at 09:37
BTW, this calling convention working, and building with `nasm -felf32` (not elfx32), proves that you're not on [x32](//en.wikipedia.org/wiki/X32_ABI) which you've said in earlier questions. x32 (ILP32 in 64-bit mode) would pass the first 3 args in EDI, ESI, and EDX. This is plain old 32-bit x86, with stack args, i386 System V ABI. Please get the name of the architecture right. You can call it x86, x86-32 if you want, or even IA-32 to make it explicit that you mean 32-bit mode. ([The most correct way to refer to 32-bit and 64-bit versions of programs](//stackoverflow.com/q/53364320)) — Peter Cordes, Apr 29 '21 at 13:21

Peter Cordes · Accepted Answer · 2021-04-30T14:11:45.337

You're doing signed compares on pointers; don't do that. Use jne in this case since you will always reach exact equality at the exit point.

Or if you want relational compares with pointers, usually unsigned conditions like jb and jae make the most sense. (It's normal to think of virtual address space as a flat linear 4GiB with the lowest address being 0, so you need increments across the middle of that range to work).

With arrays larger than your ~300MiB size, and the default linker script for PIE executables, apparently one of them will span the 2GiB boundary between signed-positive and signed-negative¹. So the end-pointer you calculate will be "negative" if you treat it as a signed integer. (Unlike on x86-64, where the non-canonical "hole" spanning the middle of virtual address-space means that an array can never span the signed-wraparound boundary: Should pointer comparisons be signed or unsigned in 64-bit x86? - sometimes it does make sense to use signed compares there.)

You should see this with a debugger if you single-step and look at the pointer values, and the memory value you create with size += dest (add [esp + 12], eax). As a signed operation, that overflows to create a negative end_pointer, while the start pointer is still positive. pos < neg is false on the first iteration, so your loop exits, you can see this when single-stepping.

Footnote 1: On my system, under GDB (which disables ASLR), after start to get the executable mapped to Linux's default base address for PIEs (2/3 of the way into the low half of the address space, i.e. 0x5555...), I checked the addresses with your test case:

sr at 0x56559040
ds at 0x6a998d40
end of ds at p /x sizeof(ds) + ds = 0x7edd8a40

So if it were much bigger, it would cross 0x80000000. That's why 340000000 avoids your bug but larger sizes reveal it.

BTW, under a 32-bit kernel, Linux defaults to a 3:1 split of address space between kernel and user-space, so even there it's possible for this to happen. But under a 64-bit kernel, 32-bit processes can have the entire 4 GiB address space to themselves. (Except for a page or two reserved by the kernel: see also Why can't I mmap(MAP_FIXED) the highest virtual page in a 32-bit Linux process on a 64-bit kernel?. That also means that forming a pointer to one-past-end of any array like you're doing (which ISO C promises is valid to do), won't wrap around and will still compare above a pointer into the object.)

This won't happen in 64-bit mode: there's enough address space to just divide it evenly between user and kernel, as well as there being a giant non-canonical hole between high and low ranges.

An excellent answer. But I'm confused by your claim that `0x1443fd00` is more than half of 2GiB. — TonyK, Apr 30 '21 at 13:51
@TonyK: Oh, yes, 0x14... isn't 1.4GiB. Derp. Also, the OP is saying this size *doesn't* have a problem, but it's about the largest they found that works properly. Not that *this* size is causing a problem. Replaced with correct explanation of why larger sizes push the end of the array past 2GiB. — Peter Cordes, Apr 30 '21 at 14:13

Why my own memcpy written on NASM can not copy more than 340000000 bytes?

1 Answers1