What C++ code compiles down to the x86 REP instruction?

Question

I'm copying elements from one array to another in C++. I found the rep movs instruction in x86 that seems to copy an array at ESI to an array at EDI of size ECX. However, neither the for nor while loops I tried compiled to a rep movs instruction in VS 2008 (on an Intel Xeon x64 processor). How can I write code that will get compiled to this instruction?

Let me get this straight. You want to use C++ (a medium- to high-level language) to write assembler instructions? What's next? You want to use C++ to attach a diode to your motherboard? — JUST MY correct OPINION, Jan 27 '11 at 05:40
@Michael: Not portably. For example, for MSVC it's not even supported on x64, and it's deprecated (in favor of intrinsics) on x86. — Billy ONeal, Jan 27 '11 at 06:03
No, C++ doesn't have assembly blocks. If you think I'm wrong, please feel free to cite the relevant page of the standard. (Hint: this is not possible.) Some C++ **compilers** may have assembly blocks. This is non-standard and non-portable, however. The question remains meaningless. For example I took a look at my GCC compiler for my SPARC machine. Strangely enough I couldn't find a `rep` instruction.... — JUST MY correct OPINION, Jan 27 '11 at 06:05
@JUST Good heavens. I'm glad no one uses C and C++ to write operating systems, then. — Crashworks, Jan 27 '11 at 06:06
@JUST ...nonetheless, it's not like his question was patently insane. It may be a bad idea, but "You want to use C++ to attach a diode to your motherboard?" is just rude — Michael Mrozek, Jan 27 '11 at 06:09
@Crashworks: They don't. They use C and assembler or C++ and assembler. I'm pretty damned sure that there's exactly ZERO operating systems written only in C or C++. — JUST MY correct OPINION, Jan 27 '11 at 07:16
@Just, why do you have such a problem with this question? He learned about a certain CPU instruction used with loops, so he wrote a loop in C++ and checked what CPU instructions it yielded. They weren't what he was hoping for, so now the question asks what C++ code he *can* write that might use the CPU instruction he's interested in. It has nothing to do with whether the compiler supports asm blocks. If I might paraphrase you: "The question about code-generation is meaningless because my Sparc doesn't have the same instruction set as a Xeon." Come on! — Rob Kennedy, Jan 27 '11 at 07:55
@Just, assembler declarations (not blocks though) are described in section [dcl.asm], in C++03 it corresponds to 7.4. — avakar, Jan 27 '11 at 08:06
@Just, you're exactly right: "ZERO operating systems written only in C or C++" Assembly is used too. Be a professional and answer the question, it is valid you know. — Olof Forshell, Jan 27 '11 at 15:09
This question only has meaning in terms of a specified compiler running on specified hardware. The OP mentions *his* compiler and hardware, but that's no help because we don't know what compiler--hardware pairing was used to create the code in question. Not a Real Question without more details. — dmckee --- ex-moderator kitten, Jan 28 '11 at 23:12

score 11 · Answer 1 · answered Jan 27 '11 at 05:58

11

Honestly, you shouldn't. REP is sort of an obsolete holdover in the instruction set, and actually pretty slow since it has to call a microcoded subroutine inside the CPU, which has a ROM lookup latency and is nonpipelined as well.

In almost every implementation, you will find that the memcpy() compiler intrinsic both is easier to use and runs faster.

answered Jan 27 '11 at 05:58

Crashworks

40,496
12
101
170

REP is not an instruction, it's an instruction prefix. It's also far from obsolete (see amd64 instruction set). – Michael Foukarakis Jan 27 '11 at 08:01
5

@Michael Foukarakis See the "AMD Software Optimization Guide For AMD64 Processors", section 8.3. "Avoid using the REP prefix when performing string operations, especially when copying blocks of memory. In general, using the REP prefix to repeatedly perform string instructions is less optimal than other methods, especially when copying blocks of memory." – Crashworks Jan 27 '11 at 08:03
Interesting. I know this is off-topic, but what would be -- in x86 or amd64 assembler terms -- an optimal way to copy a block of memory? – avakar Jan 27 '11 at 08:07
2

@avakar: It can vary a little depending on the particular chipset and stepping, but that same document has an optimal algorithm in section 5.13: 32 bytes at a time, by issuing pairs of 64bit `mov` ops. The `memcpy()` instrinsics in GCC, MSVC, and ICC are all smart enough to issue the optimal instruction pattern for a given block size. – Crashworks Jan 27 '11 at 08:12
@Crashworks: I'm aware it's not optimal, I didn't dispute that. – Michael Foukarakis Jan 27 '11 at 08:23
@Michael Foukarakis: The only reason to keep `rep` in 64bit instrucion set is backward compatibility. – ruslik Jan 27 '11 at 08:59
@Crashworks: is it always optimal to call a routine that is centered around copying 32 bytes at a time? What if you only need to copy 31? Or 12? – Olof Forshell Jan 27 '11 at 14:53
@Ruslik: "The only reason to keep rep in 64bit instrucion set is backward compatibility" you know this for fact? – Olof Forshell Jan 27 '11 at 14:54
2

@Olof : Obviously one deals with the <32b remainder by performing a smaller number of individual moves (eg Duff's Device). It's still more efficient to issue the move ops individually than hit the microcoded REP. I recommend you read AMD and Intel's source docs on the subject; they are clearer and more authoritative than Some Guy On The Internet. – Crashworks Jan 27 '11 at 21:10

Necrolis · Answer 2 · 2011-01-27T10:18:10.767

Under MSVC there are the __movsxxx & __stosxxx intrinsics that will generate a REP prefixed instruction.

there is also a 'hack' to force intrinsic memset aka REP STOS under vc9+, as the intrinsic no longer exits, due to the sse2 branching in the crt. this is better that __stosxxx due to the fact the compiler can optimize it for constants and order it correctly.

#define memset(mem,fill,size) memset((DWORD*)mem,((fill) << 24|(fill) << 16|(fill) << 8|(fill)),size)
__forceinline void memset(DWORD* pStart, unsigned long dwFill, size_t nSize)
{
    //credits to Nepharius for finding this
    DWORD* pLast = pStart + (nSize >> 2);
    while(pStart < pLast)
        *pStart++ = dwFill;

    if((nSize &= 3) == 0)
        return;

    if(nSize == 3)
    {
        (((WORD*)pStart))[0]   = WORD(dwFill);
        (((BYTE*)pStart))[2]   = BYTE(dwFill);
    }
    else if(nSize == 2)
        (((WORD*)pStart))[0]   = WORD(dwFill);
    else
        (((BYTE*)pStart))[0]   = BYTE(dwFill);
}

of course REP isn't always the best thing to use, imo your way better off using memcpy, it'll branch to either sse2 or REPS MOV based on your system (under msvc), unless you feeling like writing custom assembly for 'hot' areas...

score 3 · Accepted Answer · edited May 23 '17 at 12:13

3

If you need exactly that instruction - use built-in assembler and write that instruction manually. You can't rely on the compiler to produce any specific machine code - even if it emits it in one compilation it can decide to emit some other equivalent during next compilation.

edited May 23 '17 at 12:13

Community

1
1

answered Jan 27 '11 at 08:04

sharptooth

167,383
100
513
979

Writing the instruction manually will often upset the compiler's optimization and in such cases - if speed is important - you're better off calling the library routines. – Olof Forshell Jan 27 '11 at 14:44
@Olof Forshell: Well, yes. But why would someone need specifically this instruction anyway? – sharptooth Jan 27 '11 at 14:56
As I've written in an answer here there are specific situations where an inline rep movsb/movsw/movsd et al will be faster and more compact resulting in less cache work on the instruction side. If I want to copy less than 32 bytes why call a routine somewhere else which is optimized for 32 byte chunks when I can do it faster and less disruptive inline? – Olof Forshell Jan 27 '11 at 19:38
1

Thank you your answer. :) I just wonder what case that compiler generates 'rep' instruction for 'for' or 'while' loop. – securelsh Jan 28 '11 at 04:58
1

@securelsh : It doesn't, if it's a good compiler. The rep prefix just makes everything slower in modern x86 implementations. – Crashworks Jan 28 '11 at 08:31

Nand Xorsson · Answer 4 · 2017-08-03T08:23:15.570

REP and friends was nice once upon a time, when the x86 CPU was a single-pipeline industrial CISC-processor.

But that has changed. Nowadays when the processor encounters any instruction, the first it does is translating it into an easier format (VLIW-like micro-ops) and schedules it for future execution (this is part of out-of-order-execution, part of scheduling between different logical CPU cores, it can be used to simplifying write-after-write-sequences into single-writes, et.c.). This machinery works well for instructions that translates into a few VLIW-like opcodes, but not machine-code that translates into loops. Loop-translated machine code will probably cause the execution pipeline to stall.

Rather than spending hundreds of thousands of transistors into building CPU-circuitry for handling looping portions of the micro-ops in the execution pipeline, they just handle it in some sort of crappy legacy-mode that stutterly stalls the pipeline, and ask modern programmers to write your own damn loops!

Therefore it is seldom used when machines write code. If you encounter REP in a binary executable, its probably a human assembly-muppet who didn't know better, or a cracker that really needed the few bytes it saved to use it instead of an actual loop, that wrote it.

(However. Take everything I just wrote with a grain of salt. Maybe this is not true anymore. I am not 100% up to date with the internals of x86 CPUs anymore, I got into other hobbies..)

Olof Forshell · Answer 5 · 2011-01-27T14:49:55.020

I use the rep* prefix variants with cmps*, movs*, scas* and stos* instruction variants to generate inline code which minimizes the code size, avoids unnecessary calls/jumps and thereby keeps down the work done by the caches. The alternative is to set up parameters and call a memset or memcpy somewhere else which may overall be faster if I want to copy a hundred bytes or more but if it's just a matter of 10-20 bytes using rep is faster (or at least was the last time I measured).

Since my compiler allows specification and use of inline assembly functions and includes their register usage/modification in the optimization activities it is possible for me to use them when the circumstances are right.

score 0 · Answer 6 · edited Feb 28 '11 at 23:25

On a historic note - not having any insight into the manufacturer's strategies - there was a time when the "rep movs*" (etc) instructions were very slow. I think it was around the time of the Pentium/Pentium MMX. A colleague of mine (who had more insight than I) said that the manufacturers had decreased the chip area (<=> fewer transistors/more microcode) allocated to the rep handling and used it to make other, more used instructions faster.

In the fifteen years or so since rep has become relatively speaking faster again which would suggest more transistors/less microcode.

What C++ code compiles down to the x86 REP instruction?

6 Answers6