5

I tried to create my memcpy code with rep movsb instruction. It works perfectly with any size when the optimization is disabled. But, when I enable optimization, it does not work as expected.

Questions

  1. How to prevent gcc optimization breaking rep movsb code?
  2. Is there something wrong with my code so it leads to undefined behavior?

Motivation to create my own memcpy:

I read about enhanced movsb for memcpy from Intel® 64 and IA-32 Architectures Optimization Reference Manual section 3.7.6. I came to the libc source code and I saw default memcpy from libc uses SSE instead of movsb.

Hence, I want to compare the performance between SSE instruction and rep movsb for memcpy. But now, I find something wrong with it.

Simple code to reproduce the problem (test.c)

#include <stdio.h>
#include <string.h>

inline static void *my_memcpy(
  register void *dest,
  register const void *src,
  register size_t n
) {
  __asm__ volatile(
    "mov %0, %%rdi;"
    "mov %1, %%rsi;"
    "mov %2, %%rcx;"
    "rep movsb;"
    :
    : "r"(dest), "r"(src), "r"(n)
    : "rdi", "rsi", "rcx"
  );
  return dest;
}

#define to_boolean_str(A) ((A) ? "true" : "false")

int main()
{
  char src[32];
  char dst[32];

  memset(src, 'a', 32);
  memset(dst, 'b', 32);

  my_memcpy(dst, src, 1);
  printf("%s\n", to_boolean_str(!memcmp(dst, src, 1)));

  my_memcpy(dst, src, 2);
  printf("%s\n", to_boolean_str(!memcmp(dst, src, 2)));

  my_memcpy(dst, src, 3);
  printf("%s\n", to_boolean_str(!memcmp(dst, src, 3)));

  return 0;
}

Compile and run

ammarfaizi2@integral:~$ gcc --version
gcc (Ubuntu 9.3.0-10ubuntu2) 9.3.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

ammarfaizi2@integral:~$ gcc -O0 test.c -o test && ./test
true
true
true
ammarfaizi2@integral:~$ gcc -O1 test.c -o test && ./test
false
true
true
ammarfaizi2@integral:~$ gcc -O2 test.c -o test && ./test
false
true
true
ammarfaizi2@integral:~$ gcc -O3 test.c -o test && ./test
false
true
true
ammarfaizi2@integral:~$ 

Summary

my_memcpy(dst, src, 1); results in wrong behavior if optimizations are enabled.

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
Ammar Faizi
  • 1,393
  • 2
  • 11
  • 26
  • 1
    The `register` keyword doesn't really do anything anymore, other than preventing you from using the address-of operator on it, I believe. – Christian Gibbons Sep 30 '20 at 15:59
  • 5
    The main problem is that nothing tells the compiler that src, dest, and n may not be in rdi, rsi, and rcx. To fix this, use `asm volatile("rep movsb" : "+D"(dest), "+S"(src), "+c"(n) :: "memory")` – fuz Sep 30 '20 at 16:01
  • 7
    As a rule of thumb; whenever you have a `mov` instruction in inline assembly, you are very likely doing something wrong. Use constraints to place values into the correct registers before the statement executes. Do not move values around yourself. – fuz Sep 30 '20 at 16:02
  • 3
    @fuz: That's not the case. Their presence in the clobber list prevents them from being allocated for inputs or outputs. The asm as written is inefficient, but not erroneous, except for the missing `"memory"` clobber. – R.. GitHub STOP HELPING ICE Sep 30 '20 at 16:42
  • 2
    @ChristianGibbons : With GCC the `register` keyword is useful (not in this case though) if you need to pass data through a specific register to GCC's extended inline assembly that doesn't have a specific register constraint associated with it (anything outside of RAX, RBX, RCX, RDX, RSI, RDI) – Michael Petch Sep 30 '20 at 18:04
  • 2
    I would have coded it this way: https://godbolt.org/z/54KfdT . It also happens to show how to do it without the `memory` clobber by using dummy memory constraints and doesn't require `volatile`. If using C99 or later `src` and `dest` should really be qualified with `restrict` since the two memory regions aren't allowed to overlap. There is also a reasonable chance your `my_memcmp` and `my_memset` have issues as well. – Michael Petch Sep 30 '20 at 18:08
  • You should probably clear the direction flag too... – Antti Haapala -- Слава Україні Sep 30 '20 at 21:03
  • 3
    @AnttiHaapala: The x86-64 System V ABI requires that DF=0 on function entry and exit. IDK if there's any official guarantee that GCC won't `std` in the middle of a function, but in practice it will never emit `std`. So the de-facto situation is that asm statements can assume DF=0 on entry, and must have it that was on exit. I don't think GCC could change this without breaking some real-world code, so it's probably safe to continue assuming. – Peter Cordes Oct 01 '20 at 01:46
  • This question is not a duplicate of the ones it was marked duplicate of, because the problem at hand is not how REP MOVSB works or why it's not as fast as expected, but why the compiler was able to make transformations that broke OP's version with the constraints as written. As such, I've used my reopen-hammer. – R.. GitHub STOP HELPING ICE Oct 01 '20 at 03:14
  • 1
    @R..GitHubSTOPHELPINGICE come on, it's an exact duplicate of [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259), which explains the problem in detail. (To be honest, my answer there would benefit from more copy/pastable examples) I only moved the ERMSB question first in the duplicate list because the very top of that question has a safe inline-asm implementation the OP could copy/paste. I'd have appreciated if you discussed it in comments before re-opening; I can't now re-close it. (Although you can if you agree.) – Peter Cordes Oct 01 '20 at 03:18
  • 1
    Does this answer your question? [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/questions/56432259/how-can-i-indicate-that-the-memory-pointed-to-by-an-inline-asm-argument-may-be) . I concur with Peter that this is a duplicate. – Michael Petch Oct 01 '20 at 09:10
  • I think my question is not duplicate with that one. But indeed the @Peter Cordes's answer is related to my problem (I can also see the point that can resolve my problem from his answer), even if it does not directly answer the problem I had. Personally, it's related, not duplicate. However, I still appreciate that, upvoted the answer. – Ammar Faizi Oct 02 '20 at 04:08
  • 1
    @Ammar: Glad it helped :) It's not an *exact* duplicate only because you didn't realize why your `asm` constraints allowed those optimization, but the question and answer explain in detail the point you were missing, and the basic concept of the question is the same. That's sufficient to mark it a duplicate on Stack Overflow. I also re-added [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) as a second duplicate that shows a working `asm volatile` wrapper for `rep movsb` you can use. (And if you want to know about performance of `rep movsb`, read the answer there!) – Peter Cordes Oct 02 '20 at 04:42

1 Answers1

6

As written, your asm constraints do not reflect that the asm statement can modify memory, so the compiler can freely reorder it with respect to operations that read or write the memory at dest or src. You need to add "memory" to the clobber list.

As others have noted, you should also edit the constraints to avoid mov. If you do so, you'll need to also represent in the constraints the fact that the asm now modifies its arguments (e.g. make them all dual input/output) and backup the value of dest so you can return it. So you might skip this improvement until you've gotten it working to begin with and until you understand how constraints work.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 2
    [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) shows a working inline-asm statement wrapping just a `rep movsb`. It needs `volatile`, though, unlike [Michael Petch's version](https://stackoverflow.com/questions/64141318/how-to-prevent-gcc-optimization-breaking-rep-movsb-code#comment113427225_64141318) https://godbolt.org/z/54KfdT which uses the cast-to-array syntax shown in [How can I indicate that the memory \*pointed\* to by an inline ASM argument may be used?](https://stackoverflow.com/q/56432259) – Peter Cordes Oct 01 '20 at 01:48