Compiler choice of not using REP MOVSB instruction for a byte array move

Question

I'm checking the Release build of my project done with the latest version of the VS 2017 C++ compiler. And I'm curious why did compiler choose to build the following code snippet:

//ncbSzBuffDataUsed of type INT32

UINT8* pDst = (UINT8*)(pMXB + 1);
UINT8* pSrc = (UINT8*)pDPE;
for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
{
    pDst[i] = pSrc[i];
}

as such:

        UINT8* pDst = (UINT8*)(pMXB + 1);
        UINT8* pSrc = (UINT8*)pDPE;
        for(size_t i = 0; i < (size_t)ncbSzBuffDataUsed; i++)
00007FF66441251E 4C 63 C2             movsxd      r8,edx  
00007FF664412521 4C 2B D1             sub         r10,rcx  
00007FF664412524 0F 1F 40 00          nop         dword ptr [rax]  
00007FF664412528 0F 1F 84 00 00 00 00 00 nop         dword ptr [rax+rax]  

00007FF664412530 41 0F B6 04 0A       movzx       eax,byte ptr [r10+rcx]  
        {
            pDst[i] = pSrc[i];
00007FF664412535 88 01                mov         byte ptr [rcx],al  
00007FF664412537 48 8D 49 01          lea         rcx,[rcx+1]  
00007FF66441253B 49 83 E8 01          sub         r8,1  
00007FF66441253F 75 EF                jne         _logDebugPrint_in_MainXchgBuffer+0A0h (07FF664412530h)  
        }

versus just using a single REP MOVSB instruction? Wouldn't the latter be more efficient?

What's the surrounding code? Can the compiler prove that the src and dst don't overlap? Declaring function args or globals with `__restrict`, like `uint8_t *__restrict pDPE`, will promise the compiler that the pointed-to memory isn't also access any other way. Failed aliasing analysis defeats auto-vectorization in general. And BTW, a vector copy loop is usually slightly better than `rep movsb` ([Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231)), but for large copies maybe only with runtime CPU dispatching for AVX, because `rep movsb` can use 256b loads/stores. — Peter Cordes, Jul 01 '18 at 06:01
Can you post a [mcve] of this code-gen on http://godbolt.org/, with one of the MSVC versions it has? (Use `/Ox` for full optimization.) — Peter Cordes, Jul 01 '18 at 06:05
@PeterCordes _Can the compiler prove that the `src` and `dst` don't overlap?_ Does it have to, with the code as posted? — Paul Sanders, Jul 01 '18 at 06:10
@PaulSanders: If it for some reason doesn't recognize it as `memmove` or have a specific `rep movs` peephole for this pattern, then yes, it does have to if it wants to auto-vectorize with 16-byte loads/stores. That's basically the other main option for doing something better than a byte loop. If the compiler does detect memcpy but not memmove patterns, it would also need alias analysis. Or whatever other internal implementation detail. Compilers are complex pieces of machinery, and that's def. the first place I'd start poking at it to see if I could change the asm output. — Peter Cordes, Jul 01 '18 at 06:15
@PeterCordes (just for the record) OK, so that probably explains why it didn't vectorise the loop, but not why it avoided using `rep movsb`, so we still have something of a mystery there. — Paul Sanders, Jul 01 '18 at 06:52
@PaulSanders: Yes, we do have a mystery. Maybe it doesn't look for `memcpy`-like patterns, and instead relied on auto-vectorization to create vector copy loops? gcc's memset pattern recognizer only works on 1-byte patterns; it leaves 4-byte patterns for the auto-vectorizer to handle. But I think gcc recognizes memcpy-like patterns with any width of object. But still it's plausible that MSVC assumes vectorization will take care of copies in general that are safe to speed up. If I wasn't lazy, I'd cook up test cases with/without aliasing. — Peter Cordes, Jul 01 '18 at 07:03
I'm curious why you not use `memcpy(pMXB + 1, pDPE, ncbSzBuffDataUsed);` ? compiler implement exactly what you write in src code. want more efficient binary ? write more efficient src — RbMm, Jul 01 '18 at 08:01
@PeterCordes: Your intuition was right: if I add `__restrict`, then MSVC uses `memcpy`: https://godbolt.org/g/PUfErC. — geza, Jul 01 '18 at 08:45
Everyone, I apologize for the delay. I was running my tests and didn't see this thread. Wow, this seemingly simple subject of moving a byte array is surprisingly complex! Thanks for all the links. I'll have to go through them. @RbMm you might be right. I will probably have to stick with `memmove` (can't use `memcpy` as the source overlaps with dest. Although source always preceded destination and `rep movsb` works fine, still have to follow the documentation for `memcpy`.) I also just stepped into `memcpy` and wow, it's a very complex branch of loops that it implements internally. — c00000fd, Jul 01 '18 at 09:36
@c00000fd I guess I should have picked up on this before, but I imagine Microsoft have gone to much greater lengths to optimise `memmove` (as they have for `memcpy`, either when the compiler decides to inline it inside or the library function itself) then they probably ever have or would for an explicit loop like you coded here. Maybe that's all there is to this. Like you, I have looked inside `memcpy` (but not `memmove`) and they have indeed gone the extra mile there. — Paul Sanders, Jul 01 '18 at 16:20

Paul Sanders · Accepted Answer · 2018-07-02T09:04:04.343

5

Edit: First up, there's an intrinsic for rep movsb which Peter Cordes tells us would be much faster here and I believe him (I guess I already did). If you want to force the compiler to do things this way, see: __movsb(): https://learn.microsoft.com/en-us/cpp/intrinsics/movsb.

As to why the compiler didn't do this for you, in the absence of any other ideas the answer might be register pressure. To use rep movsb The compiler would have to:

set up rsi (= source address)
set up rdi (= destination address)
set up rcx (= count)
issue the rep movsb

So now it has had to use up the three registers mandated by the rep movsb instruction, and it may prefer not to do that. Specifically rsi and rdi are expected to be preserved across a function call, so if the compiler can get away with using them in the body of any particular function it will, and (on initial entry to the method, at least) rcx holds the this pointer.

Also, with the code that we see the compiler has generated there, the r10 and rcxregisters might already contain the requisite source and destination addresses (we can't see that from your example), which would be handy for the compiler if so.

In practise, you will probably see the compiler make different choices in different situations. The type of optimisation requested (/O1 - optimise for size, vs /O2 - optimise for speed) will likely also affect this.

More on the x64 register passing convention here, and on the x64 ABI generally here.

Edit 2 (again inspired by Peter's comments):

The compiler probably decided not to vectorise the loop because it doesn't know if the pointers are aligned or might overlap. Without seeing more of the code, we can't be sure. But that's not strictly relevant to my answer, given what the OP actually asked about.

edited Jul 02 '18 at 09:04

answered Jul 01 '18 at 05:43

Paul Sanders

24,133
4
26
48

Also `rep movsb` works byte by byte (at least conceptually), and moving in a loop data word by word might be faster – Basile Starynkevitch Jul 01 '18 at 05:50
Hah, thanks. I didn't know that they had an intrinsic for that. Cool! Let me play with it. Although it's not exactly the `rep _movsb` combo, is it? I guess I need to compile it and see what happens. :) Now I'm also wondering if I should unwind this loop into two: one using `rep movsq` and then `rep movsb` for the remainder. – c00000fd Jul 01 '18 at 05:50
@c00000fd: in general, leave such micro-optimizations to your compiler (but don't forget to enable optimizations when compiling). Compilers are better than humans on these. – Basile Starynkevitch Jul 01 '18 at 05:52
@Basile Not here, seems. Unrolling the loop manually might prove a big win here, speed-wise (but choosing the right compiler switches might do it for you). – Paul Sanders Jul 01 '18 at 05:53
1

Then, it could be time to switch to a better compiler (or just fine-tune the many optimization flags your compiler could provide). Did you try [GCC](http://gcc.gnu.org/) or [Clang](http://clang.llvm.org/) or [icc](https://software.intel.com/en-us/c-compilers) ?? – Basile Starynkevitch Jul 01 '18 at 05:53
@Basile Oh don't start :) – Paul Sanders Jul 01 '18 at 05:54
1

Register pressure is not really plausible. Any decent compiler knows that it's worth saving/restoring a register or two if it enables a big speedup inside a loop. Compiling loops efficiently is job #1 for compilers. (And yes I'm including MSVC here; not gimping loops to save a couple prologue/epilogue instructions is a pretty low bar for compilers.) Almost certainly something stopped MSVC from auto-vectorizing or recognizing this as a memcpy / memmove. (Inserting a call to memmove would be a good optimization here if the count is normally large). – Peter Cordes Jul 01 '18 at 06:07
1

@c00000fd: Intel's optimization manual has a whole section on implementing `memcpy` with either a vector loop vs. `rep movsb`, and alignment considerations. If you only care about CPUs since IvyBridge, then just use `rep movsb` if you insist on using `rep movs` at all, because the ERMSB feature makes it at least as good as `rep movsq`. But make sure your pointers are both aligned, or else avoid `rep movs`. **See performance links in https://stackoverflow.com/tags/x86/info.** – Peter Cordes Jul 01 '18 at 06:11
@BasileStarynkevitch: Well if not actual `rep movsb`, then recognizing memcpy patterns and doing something less terrible than a byte loop is a useful compiler feature. Inlining `rep movsb` would be better than a byte loop for most CPUs except if the average copy size was under maybe 15 bytes. I'd assume that MSVC knows that (i.e. that its cost model reflects that), but of course `rep movsb` has performance potholes for misalignment, and isn't always the *best* choice. So it may not have an optimization that actually inlines any kind of `rep movs`. – Peter Cordes Jul 01 '18 at 06:20
@PeterCordes It's entirely plausible, in the context of the OP's question. Why the compiler decided not to vectorise the loop is a different issue, please don't downvote on just that basis. – Paul Sanders Jul 01 '18 at 06:21
I look at quite a lot of compiler output. (mostly from gcc/clang, but some MSVC). I don't think it's plausible that register pressure could prevent optimization of this loop into `rep movsb`, if the compiler *ever* inlines `rep movsb`. Saving/restoring call-preserved RSI/RDI is very cheap, and many functions already do that. The loop uses 3 regs already, and even if total register pressure was an issue, MSVC is smart enough to spill / reload other stuff to make room for optimizing. – Peter Cordes Jul 01 '18 at 06:29
Register pressure is a very weak feedback signal for the optimizer; you don't tend to see different code-gen in functions that already needed to save/restore more regs for something else. Its more like the optimizer chooses what's optimal locally, and then register allocation for the whole function figures out how to make that happen for the whole function as cheaply as possible. This can involve spill/reload, but typically not re-running optimization passes to make different choices based on the register allocator trying to save work. – Peter Cordes Jul 01 '18 at 06:31
@PeterCordes Question is, I guess, is that loop significantly more expensive than `rep movsb`, and I imagine the answer is probably yes so I'm with you there. So, why _did_ MSVC do this? And yes, I'm sure the compiler knows how to generate `rep movsb` :) – Paul Sanders Jul 01 '18 at 06:31
IDK, that's why I upvoted the question and downvoted this answer. :/ I'd be happy to remove my downvote if you move your register-allocation guess to a "implausible guess" section of you answer and leave the useful link to the intrinsic for `rep movs`. – Peter Cordes Jul 01 '18 at 06:33
@PeterCordes OK, how about if I change 'probably' to 'may be', and add something along the lines of 'in the absence of any other plausible explanation'? And I'll review the edit at the end, that might need a bit of touching up as well. – Paul Sanders Jul 01 '18 at 06:35
BTW yes, for medium to large copies `rep movsb` will be on the order of 8 to 32 times faster than a byte loop, depending on (mis)alignment, which microarchitecture, and which level of cache you bottleneck on. The loop should run comfortably at 1 byte per cycle on mainstream Intel CPUs, and Ryzen. But even slower than that on Bulldozer because sub/jne can't macro-fuse; they should have used `cmp`/`jne` against an end-pointer. – Peter Cordes Jul 01 '18 at 06:38
I made an edit to show what *I* think would be a reasonable edit. You'll want to make your own edit on top of that, I assume, but this version of the answer doesn't need a downvote so I removed mine. – Peter Cordes Jul 01 '18 at 06:44
OK, thanks, I tidied up a bit, that's all. Good discussion, enjoyed it. – Paul Sanders Jul 01 '18 at 06:49
1

BTW, I had been thinking that potential overlap made it a `memmove`. But that's not true: with `dst = src+1`, it turns into `memset(dst, src[0])`. Anyway, see my updated answer on [Does any of current C++ compilers ever emit "rep movsb/w/d"?](https://stackoverflow.com/q/51121856). It might be interesting for a compiler to consider inlining `rep movsb` in case of possible overlap, especially if compiling with `-Os` (optimize for size), on a CPU with ERMSB. – Peter Cordes Jul 02 '18 at 01:31
@PeterCordes Very interesting, thank you. I can see you [have] put a lot of work into this. – Paul Sanders Jul 02 '18 at 07:04
I could talk all day about what compilers could / should do. One of my favourite things. :) I might file a missed-optimization gcc bug report about this if I get around to it, but I doubt it would be implemented. IDK, maybe if IceLake's expected "fast short-rep" feature makes `rep movsb` not suck with very small counts it would be a lot more appealing. – Peter Cordes Jul 02 '18 at 07:38
@Peter I doubt the gcc guys will appreciate you submitting a bug against MSVC :) I looked up ERMSB briefly, well worth knowing about (new to me), thanks for bringing it up. Be seeing ya, we make a decent team. – Paul Sanders Jul 02 '18 at 08:58
`gcc -Os` also emits a dumb byte loop (https://godbolt.org/g/hUqgX8). So does `gcc -O2` (without auto-vectorization). But `gcc -O3` does auto-vectorize after an overlap check. (Even with `-mno-sse`, it vectorizes using a full integer register to copy 8 bytes at once. But there is a fallback byte loop...) – Peter Cordes Jul 02 '18 at 09:17

score 0 · Answer 2 · answered Jul 02 '18 at 07:46

This is not really an answer, and I can't jam it all into a comment. I just want to share my additional findings. (This is probably relevant to the Visual Studio compilers only.)

What also makes a difference is how you structure your loops. For instance:

Assuming the following struct definitions:

#define PCALLBACK ULONG64

#pragma pack(push)
#pragma pack(1)
typedef struct {
    ULONG64 ui0;

    USHORT w0;
    USHORT w1;

    //Followed by:
    //  PCALLBACK[] 'array' - variable size array
}DPE;
#pragma pack(pop)

(1) The regular way to structure a for loop. The following code chunk is called somewhere in the middle of a larger serialization function:

PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
for(size_t i = 0; i <  (size_t)info.wNumCallbackFuncs; i++)
{
    pDstClbks[i] = info.callbackFuncs[i];
}

As was mentioned somewhere in the answer on this page, it is clear that the compiler was starved of registers to have produced the following monstrocity (see how it reused rax for the loop end limit, or movzx eax,word ptr [r13] instruction that could've been clearly left out of the loop.)

    PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF7029327CF 48 83 C1 30          add         rcx,30h  
    for(size_t i = 0; i <  (size_t)info.wNumCallbackFuncs; i++)
00007FF7029327D3 66 41 3B 5D 00       cmp         bx,word ptr [r13]  
00007FF7029327D8 73 1F                jae         07FF7029327F9h
00007FF7029327DA 4C 8B C1             mov         r8,rcx  
00007FF7029327DD 4C 2B F1             sub         r14,rcx  
    {
        pDstClbks[i] = info.callbackFuncs[i];
00007FF7029327E0 4B 8B 44 06 08       mov         rax,qword ptr [r14+r8+8]  
00007FF7029327E5 48 FF C3             inc         rbx  
00007FF7029327E8 49 89 00             mov         qword ptr [r8],rax  
00007FF7029327EB 4D 8D 40 08          lea         r8,[r8+8]  
00007FF7029327EF 41 0F B7 45 00       movzx       eax,word ptr [r13]  
00007FF7029327F4 48 3B D8             cmp         rbx,rax  
00007FF7029327F7 72 E7                jb          07FF7029327E0h
    }
00007FF7029327F9 45 0F B7 C7          movzx       r8d,r15w

(2) So if I re-write it into a less familiar C pattern:

PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
for(PCALLBACK* pScrClbks = info.callbackFuncs; 
    pDstClbks < pEndDstClbks; 
    pScrClbks++, pDstClbks++)
{
    *pDstClbks = *pScrClbks;
}

this produces a more sensible machine code (on the same compiler, in the same function, in the same project):

    PCALLBACK* pDstClbks = (PCALLBACK*)(pDPE + 1);
00007FF71D7E27C2 48 83 C1 30          add         rcx,30h  
    PCALLBACK* pEndDstClbks = pDstClbks + (size_t)info.wNumCallbackFuncs;
00007FF71D7E27C6 0F B7 86 88 00 00 00 movzx       eax,word ptr [rsi+88h]  
00007FF71D7E27CD 48 8D 14 C1          lea         rdx,[rcx+rax*8]  
    for(PCALLBACK* pScrClbks = info.callbackFuncs; pDstClbks < pEndDstClbks; pScrClbks++, pDstClbks++)
00007FF71D7E27D1 48 3B CA             cmp         rcx,rdx  
00007FF71D7E27D4 76 14                jbe         07FF71D7E27EAh
00007FF71D7E27D6 48 2B F1             sub         rsi,rcx  
    {
        *pDstClbks = *pScrClbks;
00007FF71D7E27D9 48 8B 44 0E 08       mov         rax,qword ptr [rsi+rcx+8]  
00007FF71D7E27DE 48 89 01             mov         qword ptr [rcx],rax  
00007FF71D7E27E1 48 83 C1 08          add         rcx,8  
00007FF71D7E27E5 48 3B CA             cmp         rcx,rdx  
00007FF71D7E27E8 77 EF                jb          07FF71D7E27D9h
    }

00007FF71D7E27EA 45 0F B7 C6          movzx       r8d,r14w

Compiler choice of not using REP MOVSB instruction for a byte array move

2 Answers2

Linked