REP MOVSB for overlapped memory

Question

I want to know if there is a difference for instruction rep movsb for overlapped and non-overlapped memory pointers in rdi and rsi?
i.e. Is there any difference in implementation of memcpy and memmove via rep movsb instruction?

In this documentation https://www.amd.com/system/files/TechDocs/24594.pdf read the following:

Depending on the hardware implementation, string moves with the direction flag (DF) cleared to 0(up) may be faster than string moves with DF set to 1 (down). DF = 1 is only needed for certain cases of overlapping REP MOVS, such as when the source and the destination overlap

It's unclear (to me) what you're asking, but that formulation that you quote isn't particularly lucid either, IMHO. What cases of overlapping REP MOVS are there, other than when the source and destination overlap? — 500 - Internal Server Error, Jan 16 '22 at 21:25
@500-InternalServerError is there a difference in implementation memcpy and memmove via rep movsb? — xperious, Jan 16 '22 at 21:30
Between the two, or between using rep movsb and not? memcpy does not promise support for overlapping buffers, so in that you could dispense with the check for overlapping buffers and not mess with the direction flag. memmove, OTOH, works as if there was an intermediate buffer involved. — 500 - Internal Server Error, Jan 16 '22 at 21:37
@500, the quote means DF=1 is needed for overlapping moves for the case when the destination address is greater than the source address, but not when the source address is greater. But I agree the "such as" part is meaningless. Probably an editing error. — prl, Jan 17 '22 at 00:02

score 3 · Answer 1 · answered Jan 16 '22 at 21:43

rep movsb always acts exactly as if it did this. Sometimes it can run fast (wide loads/stores) and still be equivalent; sometimes it has to run slow to preserve the exact semantics in case of dst close to src in the direction of DF.

char *rdi, *rsi;
size_t rcx;         // incoming register "args"

for( ; rcx != 0 ; rcx--) {       // rep movsb.  Interruptible after a complete iteration
    *rdi = *rsi;
    rdi += (DF == 0 ? 1 : -1);
    rsi += (DF == 0 ? 1 : -1);
}

If run with dst = src+1, DF=0, and count = 16 for example, that loop (and thus rep movsb) would repeat the first byte 16 times. Each load would read the value stored by the previous store.

That's a valid implementation of memcpy, because ISO C doesn't define the behaviour in the overlap case.

But it's not a valid implementation of memmove, which is required to copy as if it read all of the source before overwriting the destination. (cppreference). So in this case, copy all the bytes forward by 1.

The standard way to achieve that without actually bouncing all the data to a temporary buffer and back is to detect if overlap would be a problem for going forwards, and if so work backwards from the ends of the buffers.

Or on systems where copying backwards is just as efficient, just branch based on dst > src unsigned compare without bringing the size into it. See Should pointer comparisons be signed or unsigned in 64-bit x86? re: the details of how one would do a comparison for possible overlap like dst+size > src

Performance

And yes, as AMD says, in current CPUs from AMD and Intel, it's much faster for DF=0, with DF=1 falling back to an actual byte-at-a-time microcode loop, instead of using fast-strings / ERMSB microcode that goes 16, 32, or 64 bytes at a time.

For medium sized copies and larger (a couple KiB or more), rep movsb on aligned src and dst with DF=0 is similar speed to an unrolled SIMD loop with the widest vectors the CPU supports, on most CPUs, within maybe 10 or 20%. (Also assuming that dst is far enough ahead of src to not cause overlap with wide SIMD loads in the microcode, or that it's below src. You could test what distance produces a fallback to a slow path.)

@xperious: Yes it uses wide load/store, although I wouldn't really call that SSE or AVX because this is just internal microcode, not x86 machine instructions. And "fast strings" was new in Pentium Pro, before SSE or MMX existed, using 8-byte chunks. [How can the rep stosb instruction execute faster than the equivalent loop?](https://stackoverflow.com/q/33480999) / [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) — Peter Cordes, Jan 16 '22 at 23:32
[MOVSD performance depends on arguments](https://stackoverflow.com/q/57137074) confirms that `rep movsd` is *much* slower for DF=1 on Skylake, like 90GiB/s L1d bandwidth vs. 3.77GiB/s, with some perf-counter analysis of the uops executed. So a memmove implementation should manually use SSE2 (baseline for x86-64) instead of considering `rep movsb` with DF=1 for the overlap case. AMD says it's "only needed", but actually they should say "unusably slow, pick a different strategy" unless you care more about code size than performance. — Peter Cordes, Jan 16 '22 at 23:33
Out of curiosity: Is there a reason why DF=1 does not get optimized microcode? Is saving storage in the microcode ROM so important? — Homer512, Jan 17 '22 at 16:56
@Homer512: I don't know of any obvious compelling reason. Microcode has to branch on DF during startup overhead anyway, so IDK why they couldn't provide a decently optimized path for reverse copying instead of going directly to the bad fallback. I doubt space is important these days, so it might just be inertia (nobody uses it because it's dog slow -> no incentive for vendors to make it fast). And/or simply not worth the development effort, including *validation* of testing every corner case for possible bad interactions with anything. More complexity = more validation required. — Peter Cordes, Jan 17 '22 at 17:02

REP MOVSB for overlapped memory

1 Answers1

Performance