How to implement memmove, not just memcpy, in assembly?

Question

I was trying to implement memmove from C into x86 assembly so I wrote:

start:
    movb source(%eax), %cl
    movb %cl, destination(%eax)
    inc %eax
    cmp num, %eax
    jne start

end:

But this is wrong, why? according to: http://www.cplusplus.com/reference/cstring/memmove/

Copies the values of num bytes from the location pointed by source to the memory block pointed by destination. Copying takes place as if an intermediate buffer were used, allowing the destination and source to overlap.

which my code doesn't support.

How can I fix this without using stack? Note: we can assume that immediately after source destination comes in memory and that num (number of bytes to copy) is far and can't be touched by wrong.

Well, since you specifically say you can assume destination is after the source, you can simply copy backwards. PS: it's unclear what your rotate is for at the beginning. — Jester, Apr 22 '21 at 00:47
When they overlap, you copy either backwards (starting from the end and decrementing) or forwards, depending on which way they overlap. — Nate Eldredge, Apr 22 '21 at 00:47
There's a sample implementation in C at https://stackoverflow.com/questions/19606399/regarding-implementation-of-memmove?noredirect=1&lq=1 which you could try to adapt. If your system has [enhanced `rep movsb`](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy) then you can use it, with the [direction flag](https://stackoverflow.com/questions/9636691/what-are-cld-and-std-for-in-x86-assembly-language-what-does-df-do?noredirect=1&lq=1) to control forwards vs backwards. — Nate Eldredge, Apr 22 '21 at 00:52
@NateEldredge: `rep movsb` is horribly slow with DF=1, including on CPUs with ERMSB. You can of course still use it for correctness, although it might be about half the speed (~2 bytes per cycle on Skylake) of this naive byte-at-a-time loop that reload `num` from memory every iteration, and loads by merging a new byte into the bottom of ECX (instead of using movzbl). (`cmp num, %eax` / jcc does IIRC still micro- and macro-fuse into a 1-uop cmp/jcc on Haswell/Skylake at least, since it's using an absolute addressing mode). — Peter Cordes, Apr 22 '21 at 01:20
@PeterCordes IIRC, Linus [Torvalds] was lobbying [in `glibc`] for `memcpy` to add a test for reversed copy and do a `jmp memmove` [or equiv, have `memcpy` just be `memmove` as _it_ does the compare]. Is the rationale to _not_ do this just that it saves a test and not-taken branch? — Craig Estey, Apr 22 '21 at 01:54
@CraigEstey: Copying forwards can be somewhat more efficient for the HW prefetchers on some hardware so you generally want to copy forwards whenever the C rules let you. And you want fewer conditional branches before you start doing useful work. Of course, if enough programs misuse `memcpy` and depend on it working like `memmove`, then having memcpy do what they don't actually want just because it's allowed to by the language standard isn't an ideal situation. Being more forgiving of "bugs" can cost performance. (Upside is maybe smaller code footprint without a separate memcpy, though.) — Peter Cordes, Apr 22 '21 at 02:05

Marco Bonelli · Answer 1 · 2021-04-22T12:15:46.690

4

The problem here is about potential overlapping in case destination - source < size (that is, both source and destination point to the same chunk of data). When this happens, you are in a situation like this:

AAAAAAAAAAAAAABBBBBBBBBBBBBB_______________________
^             ^             ^             ^
source        destination   source        destination
                             + num        + num

If you start copying from source, you will overwrite part of what you are trying to copy (in this example you will overwrite the Bs with As), losing the original values, ending up with something like this:

AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA_________
^             ^             ^             ^
source        destination   source        destination
                             + num        + num

When in reality you want this:

AAAAAAAAAAAAAAAAAAAAAAAAAAAABBBBBBBBBBBBBB_________
^             ^             ^             ^
source        destination   source        destination
                             + num        + num

You can solve this issue by checking when destination - source < num, and copying in reverse in such case (starting with eax = num).

The corresponding assembly would be something like this:

    mov  $destination, %eax
    sub  $source, %eax             # dst-src distance in bytes
    cmp  num, %eax
    jb  backwards                  # if (dst-src < copy_size) goto backwards

forward:
    mov  $0, %eax
forward_loop:
    movb source(%eax), %cl
    movb %cl, destination(%eax)
    inc %eax
    cmp num, %eax
    jne forward_loop
    jmp end

backwards:
    movl  num, %eax        # start from i=length
backwards_loop:
    movb  source-1(%eax), %cl
    movb  %cl, destination-1(%eax)
    dec   %eax
    jnz   backwards_loop

end:

edited Apr 22 '21 at 12:15

answered Apr 22 '21 at 00:52

Marco Bonelli

63,369
21
118
128

In `backwards`, I think you want `dec %eax` / `jnz`, without the `cmp`. Or `jge` if you need to run the loop body after EAX becomes `0`. (But actually I think you need `dec` *before* the copy, because array+size is one past the end. Or better, once before the loop so you can use macro-fused `dec / jge` as the loop branch. Or `source-1(%eax)` / `destination-1(%eax)`) – Peter Cordes Apr 22 '21 at 01:28
`test %eax,%eax` before `dec` is not useful: `jge` doesn't read CF, and `dec` writes the other FLAGS. I overwrote that change because I was already in the middle of fixing your AT&T syntax and commenting things. (e.g. `mov 0, %eax` was a load from absolute address `0`, not equivalent to `xor %eax,%eax`) – Peter Cordes Apr 22 '21 at 01:34
@PeterCordes thank you for your edit and comment. I'm from mobile so getting that code formatted correctly was already an achievement. We were editing together with me saving to see the result and conflicting between each other. Did not know you could do `source-1(%eax)`, not so pleasing to the eye, but oh well if it works it works. You can probably guess I'm not used to AT&T :') – Marco Bonelli Apr 22 '21 at 01:37
@PeterCordes Hi, 1) jne is the same as jnz right? 2) can you explain more my we need 2 cases I don't get it... – Apr 22 '21 at 01:57
@Marco: Yeah, the assembler can generate relocations with constant offsets from symbols for the linker to fill in, just like Intel `movzx ecx, byte [source - 1 + eax]`. Constants go outside the parens in AT&T syntax, so that's where you put them. This code is already pretty bad as far as code-review, e.g. just load `num` into a register once, like EDX which is unused. Another thing you could do is put the `dec` inside the loop ahead of the mov load/store, but that defeats macro-fusion on Intel CPUs. – Peter Cordes Apr 22 '21 at 01:58
@lion: yes, `jne` and `jnz` are synonyms for the same opcode, different human / semantic meanings. You need 2 cases so you always avoid writing an area of the source before you've read it. – Peter Cordes Apr 22 '21 at 01:59
@PeterCordes you forgot to do cmp in backwards_loop – Apr 22 '21 at 02:07
2

@lion: No, I didn't forget, I specifically removed `cmp` from Marco's answer because `dec` already sets ZF. See my first comment. `dec reg / jnz` is the idiomatic way to loop a counter down to zero. [What am I comparing to? No CMP instruction in assembly code before JNE](https://stackoverflow.com/q/48758930) – Peter Cordes Apr 22 '21 at 02:14
2

@lion I've added a more thorough explanation on why you need to copy in reverse, hope that helps. – Marco Bonelli Apr 22 '21 at 12:16

How to implement memmove, not just memcpy, in assembly?

1 Answers1