Performance of x86 rep instructions on modern (pipelined/superscalar) processors

Question

I've been writing in x86 assembly lately (for fun) and was wondering whether or not rep prefixed string instructions actually have a performance edge on modern processors or if they're just implemented for back compatibility.

I can understand why Intel would have originally implemented the rep instructions back when processors only ran one instruction at a time, but is there a benefit to using them now?

With a loop that compiles to more instructions, there is more to fill up the pipeline and/or be issued out-of-order. Are modern processors built to optimize for these rep-prefixed instructions, or are rep instructions used so rarely in modern code that they're not important to the manufacturers?

I haven't looked into this in, like, 5 years, but back then my personal experience was that at least rep movsd and rep stosd were faster than a simple loop whereas some of the scanning variants were not. That could have changed significantly since, though. — 500 - Internal Server Error, Dec 08 '11 at 01:42
Conduct a test on different processors and see for yourself. — Alexey Frunze, Dec 08 '11 at 02:00
Thanks for the input, guys. Alex: i probably will eventually, but i don't have lots of different procs to try it on, so it would just be on a real proc vs. an emulator that wouldn't have a pipeline. Also, i'm lazy and would rather not do that work if someone else might have already done it. :) — RyanS, Dec 08 '11 at 04:24
Related: [lots of detail about x86 memory bandwidth](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy), NT stores vs. regular stores, and also stuff about how a single core can't always saturate memory bandwidth (see "latency bound platforms" in the answer there). Also some comparison of `rep movs` / `stos` vs. vector loops. — Peter Cordes, Oct 19 '17 at 20:28

score 41 · Accepted Answer · answered Dec 08 '11 at 09:54

There is a lot of space given to questions like this in both AMD and Intel's optimization guides. Validity of advice given in this area has a "half life" - different CPU generations behave differently, for example:

AMD Software Optimization Guide (Sep/2005), section 8.3, pg. 167:
Avoid using the REP prefix when performing string operations, especially when copying blocks of memory.
AMD Software Optimization Guide (Apr/2011), section 9.3, pg. 148:
Use the REP prefix judiciously when performing string operations.

The Intel Architecture Optimization Manual gives performance comparison figures for various block copy techniques (including rep stosd) on Table 7-2. Relative Performance of Memory Copy Routines, pg. 7-37f., for different CPUs, and again what's fastest on one might not be fastest on others.

For many cases, recent x86 CPUs (which have the "string" SSE4.2 operations) can do string operations via the SIMD unit, see this investigation.

To follow up on all this (and/or keep yourself updated when things change again, inevitably), read Agner Fog's Optimization guides/blogs.

`rep movs` and `rep stos` are usually good (for medium to large aligned buffer), `repe / repne scas / cmps` are usually not good. — Peter Cordes, Oct 19 '17 at 20:09
Re: SSE4.2: they're maybe useful for `strstr` or other cases where you can take advantage of more of their full power, but typically not for `strcmp` or `strchr` because they're slower than `pcmpeqb`. [They're especially bad for `memcmp`](https://stackoverflow.com/questions/46762813/how-much-faster-are-sse4-2-string-instructions-than-sse2-for-memcmp/46763316#46763316) or explicit-length strings. — Peter Cordes, Oct 19 '17 at 20:18

score 10 · Answer 2 · answered Dec 08 '11 at 11:28

In addition to FrankH's excellent answer; I'd like to point out that which method is best also depends on the length of the string, its alignment, and if the length is fixed or variable.

For small strings (maybe up to about 16 bytes) doing it manually with simple instructions is probably faster, as it avoids the setup costs of more complex techniques (and for fixed size strings can be easily unrolled). For medium sized strings (maybe from 16 bytes to 4 KiB) something like "REP MOVSD" (with some "MOVSB" instructions thrown in if misalignment is possible) is likely to be best.

For anything larger than that, some people would be tempted to go into SSE/AVX and prefetching, etc. A better idea is to fix the caller/s so that copying (or strlen() or whatever) isn't needed in the first place. If you try hard enough, you'll almost always find a way. Note: Also be very wary of "supposed" fast mempcy() routines - typically they've been tested on massive strings and not tested on far more likely tiny/small/medium strings.

Also note that (for the purpose of optimisation rather than convenience) due to all these differences (likely length, alignment, fixed or variable size, CPU type, etc) the idea of having one multi-purpose "memcpy()" for all of the very different cases is near-sighted.

Ack. The Optimization guides (both Intel/AMDs as well as Agner Fog's materials and many others) do mention these things as well; in many cases, a strategy: 1. for short strings, inlined primitive instructions 2. for medium sizes, large-operand-size `rep movs` 3. for known large blocks, use the SIMD units. And always test on _your_ data, since the 'ultra-fast VVX' performance will break down if most of your strings are <8 Bytes. — FrankH., Dec 08 '11 at 14:30
IIRC `REP MOVSD` is, on modern hardware, often *much slower* than `REP MOVSB`. Probably because modern CPUs have special optimizations only for `REP MOVSB`, because it's used far more often than `REP MOVSD`. — Paul Groke, Jan 29 '16 at 21:15
@PaulGroke: There are maybe a couple CPUs where `rep movsb` is better than `rep movsd`, but most implement all the ERMSB magic for `rep movsd` / `movsq` as well. And `rep movsb` was usually *worse* on Intel CPUs before IvyBridge's Enhanced Rep MovSB feature. See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy), which has an *excellent* answer with lots of detail about x86 memory bandwidth. — Peter Cordes, Oct 19 '17 at 20:24

score 0 · Answer 3 · answered Apr 14 '16 at 14:13

Since no one has given you any numbers, I'll give you some which I've found by benchmarking my garbage collector which is very memcpy-heavy. My objects to be copied are 60% 16 bytes in length and the remainder 30% are 500 - 8000 bytes or so.

Precondition: Both dst , src and n are multiples of 8.
Processor: AMD Phenom(tm) II X6 1090T Processor 64bit/linux

Here are my three memcpy variants:

Hand-coded while-loop:

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    ptr *end = dst + n_ptrs;
    while (dst < end) {
        *dst++ = *src++;
    }
}

(ptr is an alias to uintptr_t). Time: 101.16%

rep movsb

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    asm volatile("cld\n\t"
                 "rep ; movsb"
                 : "=D" (dst), "=S" (src)
                 : "c" (n), "D" (dst), "S" (src)
                 : "memory");
}

Time: 103.22%

rep movsq

if (n == 16) {
    *dst++ = *src++;
    *dst++ = *src++;
} else {
    size_t n_ptrs = n / sizeof(ptr);
    asm volatile("cld\n\t"
                 "rep ; movsq"
                 : "=D" (dst), "=S" (src)
                 : "c" (n_ptrs), "D" (dst), "S" (src)
                 : "memory");
}

Time: 100.00%

req movsq wins by a tiny margin.

How do we fix the above code to declare the change to CX? (Declare it sets it to 0?) — Cecil Ward, Oct 06 '18 at 05:03
@CecilWard: [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) has save inline asm for `rsp movsb`. Another option would be to use `"+c"(n)` as an in/out operand. If you never read that C variable later, the compiler will effectively know the input register was destroyed. — Peter Cordes, May 02 '21 at 05:59

Performance of x86 rep instructions on modern (pipelined/superscalar) processors

3 Answers3

Linked