Does any of current C++ compilers ever emit "rep movsb/w/d"?

Question

This question made me wonder, if current modern compilers ever emit REP MOVSB/W/D instruction.

Based on this discussion, it seems that using REP MOVSB/W/D could be beneficial on current CPUs.

But no matter how I tried, I cannot made any of the current compilers (GCC 8, Clang 7, MSVC 2017 and ICC 18) to emit this instruction.

For this simple code, it could be reasonable to emit REP MOVSB:

void fn(char *dst, const char *src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}

But compilers emit a non-optimized simple byte-copy loop, or a huge unrolled loop (basically an inlined memmove). Do any of the compilers use this instruction?

For that loop (without `__restrict`), or ever? gcc has `-mmemcpy-strategy=rep_4byte` and `-minline-all-stringops` to override the tuning options. https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html. — Peter Cordes, Jul 01 '18 at 09:10
GCC will surprisingly inline `rep stosq` for zero-init with default tuning, though. (Or even `-march=haswell` where vector stores would be better: [Unexpected x64 assembly for \_\_atomic\_fetch\_or with gcc 7.3](https://stackoverflow.com/posts/comments/89168898) / https://gcc.gnu.org/bugzilla/show_bug.cgi?id=86352 — Peter Cordes, Jul 01 '18 at 09:11
@PeterCordes: I mean ever, but not helping the compiler. So not using __builtin_* things, just simple code. So the gist of my question is not: "how to make the compiler emit rep movsb", but rather "why don't compilers use this instruction in general/more often, as it seems a good alternative in a lot of cases?" — geza, Jul 01 '18 at 09:23
Right of course a builtin / intrinsic is uninteresting, but I meant if you were restricting it to `-mtune=generic` or `-mtune=haswell` or any other of the preset tunings. (i.e. if you don't allow setting the string-op strategy.) — Peter Cordes, Jul 01 '18 at 09:27
@PeterCordes: I allow every compiler option :) I just wondered, why this instruction isn't used at all, as it seems a good alternative. Instead of a byte-copy loop, it is surely better. And instead of inlining memmove, it is better in a lot of cases too (much smaller code, and on current cpus, it is not much slower), yet it seems that this instruction is not used by default. — geza, Jul 01 '18 at 09:31
we can direct use [`__movsb`](https://msdn.microsoft.com/en-us/library/hhta9830.aspx) for *cl* — RbMm, Jul 01 '18 at 09:31

Peter Cordes · Accepted Answer · 2018-07-02T09:47:41.727

GCC has x86 tuning options to control string-ops strategy and when to inline vs. library call. (See https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html). -mmemcpy-strategy=strategy takes alg:max_size:dest_align triplets, but the brute-force way is -mstringop-strategy=rep_byte

I had to use __restrict to get gcc to recognize the memcpy pattern, instead of just doing normal auto-vectorization after an overlap check / fallback to a dumb byte loop. (Fun fact: gcc -O3 auto-vectorizes even with -mno-sse, using the full width of an integer register. So you only get a dumb byte loop if you compile with -Os (optimize for size) or -O2 (less than full optimization)).

Note that if src and dst overlap with dst > src, the result is not memmove. Instead, you'll get a repeating pattern with length = dst-src. rep movsb has to correctly implement the exact byte-copy semantics even in case of overlap, so it would still be valid (but slow on current CPUs: I think microcode would just fall back to a byte loop).

gcc only gets to rep movsb via recognizing a memcpy pattern and then choosing to inline memcpy as rep movsb. It doesn't go directly from byte-copy loop to rep movsb, and that's why possible aliasing defeats the optimization. (It might be interesting for -Os to consider using rep movs directly, though, when alias analysis can't prove it's a memcpy or memmove, on CPUs with fast rep movsb.)

void fn(char *__restrict dst, const char *__restrict src, int l) {
    for (int i=0; i<l; i++) {
        dst[i] = src[i];
    }
}

This probably shouldn't "count" because I would probably not recommend those tuning options for any use-case other than "make the compiler use rep movs", so it's not that different from an intrinsic. I didn't check all the -mtune=silvermont / -mtune=skylake / -mtune=bdver2 (Bulldozer version 2 = Piledriver) / etc. tuning options, but I doubt any of them enable that. So this is an unrealistic test because nobody using -march=native would get this code-gen.

But the above C compiles with gcc8.1 -xc -O3 -Wall -mstringop-strategy=rep_byte -minline-all-stringops on the Godbolt compiler explorer to this asm for x86-64 System V:

fn:
        test    edx, edx
        jle     .L1               # rep movs treats the counter as unsigned, but the source uses signed
        sub     edx, 1            # what the heck, gcc?  mov ecx,edx would be too easy?
        lea     ecx, [rdx+1]

        rep movsb                 # dst=rdi and src=rsi
.L1:                              # matching the calling convention
        ret

Fun fact: the x86-64 SysV calling convention being optimized for inlining rep movs is not a coincidence (Why does Windows64 use a different calling convention from all other OSes on x86-64?). I think gcc favoured that when the calling convention was being designed, so it saved instructions.

rep_8byte does a bunch of setup to handle counts that aren't a multiple of 8, and maybe alignment, I didn't look carefully.

I also didn't check other compilers.

Inlining rep movsb would be a poor choice without an alignment guarantee, so it's good that compilers don't do it by default. (As long as they do something better.) Intel's optimization manual has a section on memcpy and memset with SIMD vectors vs. rep movs. See also http://agner.org/optimize/, and other performance links in the x86 tag wiki.

(I doubt that gcc would do anything differently if you did dst=__builtin_assume_aligned(dst, 64); or any other way of communicating alignment to the compiler, though. e.g. alignas(64) on some arrays.)

Intel's IceLake microarchitecture will have a "short rep" feature that presumably reduces startup overhead for rep movs / rep stos, making them much more useful for small counts. (Currently rep string microcode has significant startup overhead: What setup does REP do?)

memmove / memcpy strategies:

BTW, glibc's memcpy uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example.

Glibc's asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19.

For large inputs, it uses SSE XMM registers, AVX YMM registers, or rep movsb (after checking an internal config variable that's set based on CPU-detection when glibc initializes itself). I'm not sure which CPUs it will actually use rep movsb on, if any, but support is there for using it for large copies.

rep movsb might well be a pretty reasonable choice for small code-size and non-terrible scaling with count for a byte loop like this, with safe handling for the unlikely case of overlap.

Microcode startup overhead is a big problem with using it for copies that are usually small, though, on current CPUs.

It's probably better than a byte loop if the average copy size is maybe 8 to 16 bytes on current CPUs, and/or different counts cause branch mispredicts a lot. It's not good, but it's less bad.

Some kind of last-ditch peephole optimization for turning a byte-loop into a rep movsb might be a good idea, if compiling without auto-vectorization. (Or for compilers like MSVC that make a byte loop even at full optimization.)

It would be neat if compilers knew about it more directly, and considered using it for -Os (optimize for code-size more than speed) when tuning for CPUs with the Enhanced Rep Movs/Stos Byte (ERMSB) feature. (See also Enhanced REP MOVSB for memcpy for lots of good stuff about x86 memory bandwidth single threaded vs. all cores, NT stores that avoid RFO, and rep movs using an RFO-avoiding cache protocol...).

On older CPUs, rep movsb wasn't as good for large copies, so the recommended strategy was rep movsd or movsq with special handling for the last few counts. (Assuming you're going to use rep movs at all, e.g. in kernel code where you can't touch SIMD vector registers.)

The -mno-sse auto-vectorization using integer registers is much worse than rep movs for medium sized copies that are hot in L1d or L2 cache, so gcc should definitely use rep movsb or rep movsq after checking for overlap, not a qword copy loop, unless it expects small inputs (like 64 bytes) to be common.

The only advantage of a byte loop is small code size; it's pretty much the bottom of the barrel; a smart strategy like glibc's would be much better for small but unknown copy sizes. But that's too much code to inline, and a function call does have some cost (spilling call-clobbered registers and clobbering the red zone, plus the actual cost of the call / ret instructions and dynamic linking indirection).

Especially in a "cold" function that doesn't run often (so you don't want to spend a lot of code size on it, increasing your program's I-cache footprint, TLB locality, pages to be loaded from disk, etc). If writing asm by hand, you'd usually know more about the expected size distribution and be able to inline a fast-path with a fallback to something else.

Remember that compilers will make their decisions on potentially many loops in one program, and most code in most programs is outside of hot loops. It shouldn't bloat them all. This is why gcc defaults to -fno-unroll-loops unless profile-guided optimization is enabled. (Auto-vectorization is enabled at -O3, though, and can create a huge amount of code for some small loops like this one. It's quite silly that gcc spends huge amounts of code-size on loop prologues/epilogues, but tiny amounts on the actual loop; for all it knows the loop will run millions of iterations for each one time the code outside runs.)

Unfortunately it's not like gcc's auto-vectorized code is very efficient or compact. It spends a lot of code size on the loop cleanup code for the 16-byte SSE case (fully unrolling 15 byte-copies). With 32-byte AVX vectors, we get a rolled-up byte loop to handle the leftover elements. (For a 17 byte copy, this is pretty terrible vs. 1 XMM vector + 1 byte or glibc style overlapping 16-byte copies). With gcc7 and earlier, it does the same full unrolling until an alignment boundary as a loop prologue so it's twice as bloated.

IDK if profile-guided optimization would optimize gcc's strategy here, e.g. favouring smaller / simpler code when the count is small on every call, so auto-vectorized code wouldn't be reached. Or change strategy if the code is "cold" and only runs once or not at all per run of the whole program. Or if the count is usually 16 or 24 or something, then scalar for the last n % 32 bytes is terrible so ideally PGO would get it to special case smaller counts. (But I'm not too optimistic.)

I might report a GCC missed-optimization bug for this, about detecting memcpy after an overlap check instead of leaving it purely up to the auto-vectorizer. And/or about using rep movs for -Os, maybe with -mtune=icelake if more info becomes available about that uarch.

A lot of software gets compiled with only -O2, so a peephole for rep movs other than the auto-vectorizer could make a difference. (But the question is whether it's a positive or negative difference)!

Thanks for the explanation why `REP` is not used more frequently, now I understand the reason. But instead of the byte-copy loop, compilers should always use it, shouldn't they? — geza, Jul 01 '18 at 09:42
@geza: yes, a byte-copy loop is worse unless the loop count is always small (like 1 to maybe 15) and doesn't branch mispredict on the loop-exit. (So the count has to be predictable if not constant.) — Peter Cordes, Jul 01 '18 at 09:47
@geza: BTW, glibc's `memcpy` uses a pretty nice strategy for small inputs that's insensitive to overlap: Two loads -> two stores that potentially overlap, for copies up to 2 registers wide. This means any input from 4..7 bytes branches the same way, for example. The asm source has a nice comment describing the strategy: https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S.html#19. The only advantage of a byte loop is small code size; a smart strategy like glibc's would be much better for small but unknown copy sizes. — Peter Cordes, Jul 01 '18 at 16:47
@geza: overlap is often an efficient way to auto-vectorize with aligned pointers for the main loop. e.g. do a potentially-unaligned first vector, then start from the first alignment boundary. At the end, do a vector that ends at the last element. Works great for copy-and-x, or for modify-in-place if the operation is idempotent (like lower-casing ASCII chars), or you're careful to read the old data before overwriting. Obviously doesn't work as easily for reductions like sum of an array; you have to mask the overlap. Or length < 1 vector. But it does work for search loops like memchr. — Peter Cordes, Jul 01 '18 at 21:48
It's too bad gcc doesn't know how to auto-vectorize this way; gcc8.1 switched to normally just using unaligned loads/stores, while gcc7 and earlier go scalar until an alignment boundary, usually with fully unrolled loops. (Huge code bloat with 32B vectors and byte elements.) — Peter Cordes, Jul 01 '18 at 21:49
Peter, I really appreciate the work you put into this answer, and into this site in general, thanks for it! It always worth checking back even after your answer is accepted, because maybe you edited the answer to contain even more good stuff :) — geza, Jul 02 '18 at 04:35
@geza: oh right, I forgot to ping you after the edit on this, glad you noticed it :) I normally do ping the OP after making a significant edit. Glad you found the new stuff useful / interesting. :) — Peter Cordes, Jul 02 '18 at 04:37
@geza Seconded, couldn't agree more. The man knows his onions. — Paul Sanders, Jul 02 '18 at 08:53

Does any of current C++ compilers ever emit "rep movsb/w/d"?

1 Answers1

memmove / memcpy strategies:

Linked

Related