MOVSD performance depends on arguments

Question

I just noticed a pieces of my code exhibit different performance when copying memory. A test showed that a memory copying performance degraded if the address of destination buffer is greater than address of source. Sounds ridiculous, but the following code shows the difference (Delphi):

  const MEM_CHUNK = 50 * 1024 * 1024;
        ROUNDS_COUNT = 100;


  LpSrc := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);
  LpDest := VirtualAlloc(0,MEM_CHUNK,MEM_COMMIT,PAGE_READWRITE);

  QueryPerformanceCounter(LTick1);
  for i := 0 to ROUNDS_COUNT - 1 do
    CopyMemory(LpDest,LpSrc,MEM_CHUNK);
  QueryPerformanceCounter(LTick2);
    // show timings

  QueryPerformanceCounter(LTick1);
  for i := 0 to ROUNDS_COUNT - 1 do
    CopyMemory(LpSrc,LpDest,MEM_CHUNK);
  QueryPerformanceCounter(LTick2);
   // show timings

Here CopyMemory is based on MOVSD. The results :

Starting Memory Bandwidth Test...

LpSrc 0x06FC0000

LpDest 0x0A1C0000

src->dest Transfer: 5242880000 bytes in 1,188 sec @4,110 GB/s.

dest->src Transfer: 5242880000 bytes in 0,805 sec @6,066 GB/s.

src->dest Transfer: 5242880000 bytes in 1,142 sec @4,275 GB/s.

dest->src Transfer: 5242880000 bytes in 0,832 sec @5,871 GB/s.

Tried on two systems, the results are consistent no matter how many times repeated.

Never saw anything like that. Was unable to google it. Is this a known behavior? Is this just another cache-related peculiarity?

Update:

Here are the final results with page-aligned buffers and forward direction of MOVSD (DF=0):

Starting Memory Bandwidth Test...

LpSrc 0x06F70000

LpDest 0x0A170000

src->dest Transfer: 5242880000 bytes in 0,781 sec @6,250 GB/s.

dest->src Transfer: 5242880000 bytes in 0,731 sec @6,676 GB/s.

src->dest Transfer: 5242880000 bytes in 0,750 sec @6,510 GB/s.

dest->src Transfer: 5242880000 bytes in 0,735 sec @6,640 GB/s.

src->dest Transfer: 5242880000 bytes in 0,742 sec @6,585 GB/s.

dest->src Transfer: 5242880000 bytes in 0,750 sec @6,515 GB/s.

... and so on.

Here the transfer rates are constant.

Do both buffers have the same alignment? Could 4k aliasing be problem? Maybe in one direction the dst is at a slightly lower offset within a page then the src, so memory disambiguation can see that the loads couldn't be reloading the store. But the other way, it might falsely detect aliasing and reduce bandwidth. Have your code print the addresses. Also, what CPU hardware did you test on? Haswell? Skylake? Atom? Ryzen? K10? — Peter Cordes, Jul 21 '19 at 21:53
What happens if you reverse them? Or add a Sleep between them? — David Wohlferd, Jul 21 '19 at 22:06
Thank you for your suggestions. Changed allocation to VirtualAlloc for alignment. The output: — user4859735, Jul 21 '19 at 22:20
@DavidWohlferd: in the first version of this question, the first time was fast and the 2nd was slow, which would tend to rule out warm-up effects and page-faults after allocation. But here yeah that's a possibility. A warm-up loop to page the cost of the page-faults first would be a good idea. (Preferably by writing both buffers, although read-only of the src. could potentially give you more L1d or L3 hits if Windows maps all src virtual pages to the same physical zero-page. Like in [this recent answer about iterating C++ std::array](https://stackoverflow.com/a/57130924/224132) — Peter Cordes, Jul 21 '19 at 22:54
ERMSB was new in IvyBridge, so your SnB doesn't have it. Still, [Enhanced REP MOVSB for memcpy](//stackoverflow.com/q/43343231) has some relevant info about details of `rep movsd` copying. It can still use an RFO-avoiding protocol for large copies before ERMSB, it's just not weakly-ordered. I think there are some links in there to Intel manuals and others about alignment effects on `rep movsd` performance. But you've ruled that out by using 2 page-aligned buffers. — Peter Cordes, Jul 21 '19 at 22:59
Thank you Peter for your info. I will study that. Actually the code is part of inmemory DB application, and the real testing code does warm-ups. I have noticed other artifacts related to cache on my systems earlier. May it be due to system switching cores while running code and getting in some mess with core caches? — user4859735, Jul 21 '19 at 23:15
@BeeOnRope: `rep movsd` is only fast with `DF=0` (ascending addresses). I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes with `rep movsb` runs in 174M cycles with `cld`, vs. 4161M cycles with `std`, for page-aligned inputs or page-1 inputs (I tried both for downward, both were terrible). uops executed also confirms that it's spending many more uops when copying backwards. Your suggestion to copy backward is only viable if `rep movsd` is replaced with a SIMD loop. — Peter Cordes, Jul 22 '19 at 01:06
@user4859735: um yes, a thread migrating to another CPU core will mean lots of cache misses in private L1i/L1d and L2 caches. You're going to have to be way more specific. Also, if you do actually warm up both arrays before your timed loops, that's a *very* important missing piece of your question; update it to make it a proper [mcve]. Especially since we don't know your `ROUNDS_COUNT` vs. `MEM_SIZE`, large buffers that you only traverse a couple time could pay a huge price in page faults. It matters a huge amount that those are outside the timed loops. — Peter Cordes, Jul 22 '19 at 01:09
[`VirtualAlloc`](https://learn.microsoft.com/en-us/windows/win32/api/memoryapi/nf-memoryapi-virtualalloc) does not allocate physical pages until they are accessed. What happens if you swap the order of the two loops (i.e., do the src->dest copy first)? Or walk thru the two allocated areas (touching every page) before the first loop? — 1201ProgramAlarm, Jul 22 '19 at 03:51
I think I'm going to accept the answer. The praise goes to Peter Cordes - for best performance with MOVSD one should use 0x1000 page aligned buffers AND df=0 cleared. I must apologize - I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps. That is because I previously toggled it to use only forward copying (df=0) and it seems like copying forward with unaligned buffers incurs the same penalty as copying backwards. Though I didn't try it in a production code yet, the said method of moving forward of aligned buffers works very consistently. — user4859735, Jul 22 '19 at 07:06
There is no overlap here: two diverse pointers buffers. My guess is that @1201ProgramAlarm got it right: RAM pages are actually allocated when first accessed, so the first `CopyMemory` is slightly slower than the second - whatever the order of copy. BTW copying 5GB buffers don't make any sense IMHO: I would never like to see such pattern in any code in production. — Arnaud Bouchez, Jul 22 '19 at 09:25

Peter Cordes · Accepted Answer · 2019-07-24T02:17:51.223

Normally fast-strings or ERMSB microcode makes rep movsb/w/d/q and rep stosb/w/d/q fast for large counts (copying in 16, 32, or maybe even 64-byte chunks). And possibly with an RFO-avoiding protocol for the stores. (Other repe/repne scas/cmps are always slow).

Some conditions of the inputs can interfere with that best-case, notably having DF=1 (backward) instead of the normal DF=0.

rep movsd performance can depend on alignment of src and dst, including their relative misalignment. Apparently having both pointers = 32*n + same is not too bad, so most of the copy can be done after reaching an alignment boundary. (Absolute misalignment, but the pointers are aligned relative to each other. i.e. dst-src is a multiple of 32 or 64 bytes).

Performance does not depend on src > dst or src < dst per-se. If the pointers are within 16 or 32 byte of overlapping, that can also force a fall-back to 1 element at a time.

Intel's optimization manual has a section about memcpy implementations and comparing rep movs with well-optimized SIMD loops. Startup overhead is one of the the biggest downsides for rep movs, but so are misalignments that it doesn't handle well. (IceLake's "fast short rep" feature presumably addresses that.)

I did not disclose the CopyMemory body - and it indeed used copying backwards (df=1) when avoiding overlaps.

Yup, there's your problem. Only copy backwards if there would be actual overlap you need to avoid, not just based on which address is higher. And then do it with SIMD vectors, not rep movsd.

rep movsd is only fast with DF=0 (ascending addresses), at least on Intel CPUs. I just checked on Skylake: 1000000 reps of copying 4096 non-overlapping bytes from page-aligned buffers with rep movsb runs in:

174M cycles with cld (DF=0 forwards). about 42ms at about 4.1GHz, or about 90GiB/s L1d read+write bandwidth achieved. About 23 bytes per cycle, so startup overhead of each rep movsb seems to be hurting us. An AVX copy loop should achieve close to 32B/s with this easy case of pure L1d cache hits, even with a branch mispredict on loop exit from an inner loop.
4161M cycles with std (DF=1 backwards). about 1010ms at about 4.1GHz, or about 3.77GiB/s read+write. About 0.98 bytes / cycle, consistent with rep movsb being totally un-optimized. (1 count per cycle, so rep movsd would be about 4x that bandwidth with cache hits.)

uops_executed perf counter also confirms that it's spending many more uops when copying backwards. (This was inside a dec ebp / jnz loop in long mode under Linux. The same test loop as Can x86's MOV really be "free"? Why can't I reproduce this at all? built with NASM, with the buffers in the BSS. The loop did cld or std / 2x lea / mov ecx, 4096 / rep movsb. Hoisting cld out of the loop didn't make much difference.)

You were using rep movsd which copies 4 bytes at a time, so for backwards copying we can expect 4 bytes / cycle if they hit in cache. And you were probably using large buffers so cache misses bottleneck the forward direction to not much faster than backwards. But the extra uops from backward copy would hurt memory parallelism: fewer cache lines are touched by the load uops that fit in the out-of-order window. Also, some prefetchers work less well going backwards, in Intel CPUs. The L2 streamer works in either direction, but I think L1d prefetch only goes forward.

Related: Enhanced REP MOVSB for memcpy Your Sandybridge is too old for ERMSB, but Fast Strings for rep movs/rep stos has existed since original P6. Your Clovertown Xeon from ~2006 is pretty much ancient by today's standards. (Conroe/Merom microarchitecture). Those CPUs might be so old that a single core of a Xeon can saturate the meagre memory bandwidth, unlike today's many-core Xeons.

My buffers were page-aligned. For downward, I tried having the initial RSI/RDI point to the last byte of a page so the initial pointers were not aligned but the total region to be copied was. I also tried lea rdi, [buf+4096] so the starting pointers were page-aligned, so [buf+0] didn't get written. Neither made backwards copy any faster; rep movs is just garbage with DF=1; use SIMD vectors if you need to copy backwards.

Usually a SIMD vector loop can be at least as fast as rep movs, if you can use vectors as wide as the machine supports. That means having SSE, AVX, and AVX512 versions... In portable code without runtime dispatching to a memcpy implementation tuned for the specific CPU, rep movsd is often pretty good, and should be even better on future CPUs like IceLake.

You don't actually need page alignment for rep movs to be fast. IIRC, 32-byte aligned source and destination is sufficient. But also 4k aliasing could be a problem: if dst & 4095 is slightly higher than src & 4095, the load uops might internally have to wait some extra cycles for the store uops because the fast-path mechanism for detecting when a load is reloading a recent store only looks at page-offset bits.

Page alignment is one way to make sure you get the optimal case for rep movs, though.

Normally you get best performance from a SIMD loop, but only if you use SIMD vectors as wide as the machine supports (like AVX, or maybe even AVX512). And you should choose NT stores vs. normal depending on the hardware and the surrounding code.

Other notes: I tried rep movsd between page aligned and just 32bit aligned buffers in various combinations. On my SandyBridge al->al is the best, un->un is second to best, a->u and u->a are the worst(!). On my 'pretty old' Xeons al->al,u->u,u->a have no difference and are best, and a->u is twice as bad. And the SIMD implementation that I have is much worse than any rep movsd, even backwards. — user4859735, Jul 23 '19 at 04:42
@user4859735: When you do `u->u`, is it the same relative misalignment for both src and dst? So after an unaligned startup, it could reach an alignment boundary and get the al->al case. Also, notice that I said 32 *byte* alignment (AVX width), not 32 *bit*. Sandybridge probably only cares about 16-byte, unlike Haswell and later. — Peter Cordes, Jul 23 '19 at 13:20
@user4859735: If your SIMD implementation is slower, you're probably doing it wrong. e.g. `movups` is slow on Core 2 even if the address is aligned at runtime. Core 2 is a challenge, but Sandybridge should be efficient with appropriate loop unrolling and handling of relative misalignments. (I think the usual advice is to prefer an aligned destination, rather than aligned source, if you can't have both because of different relative misalignment.) — Peter Cordes, Jul 23 '19 at 13:23
Thank you Peter. Yes, I noticed '32 byte'). I think for the purposes of the original question it was good to know about the DF flag influence and (yet another time) about 'align as much as u can' i.e. something that affects any architecture. I personally love to optimize to full extents, but tuning esoteric opcodes to particular Bridge/Core/Merom... It is good to know though. — user4859735, Jul 23 '19 at 14:41
@user4859735: you don't always want to align as much as you can; sometimes having everything page-aligned leads to more conflict-misses in L1d cache because `a[i]`, `b[i]`, `c[i]`, etc. all map to the same set in a set-associative cache. Also it wastes more space with padding, so less useful memory is covered by a given number of TLB entries. — Peter Cordes, Jul 23 '19 at 14:48
By the way, I guess the answer to the topic should be 'Yes, MOVSD performance depends on arguments... in a certain way at least.') — user4859735, Jul 23 '19 at 14:51
Copying backwards is also slower at least since Ivy Bridge. I think the reason for the extra uops in the backwards case is because the memory access uops used are of the same width as that specified by the string instruction, in contrast to the forwards direction where the widest uops are used. This implementation improves performance of the common case (DF=0) and reduces the microcode storage overhead of the rare case (DF=1) by using uops with the native size of the instruction. — Hadi Brais, Jul 23 '19 at 17:21
Right, the alignment of the destination buffer matters much more than that of the source. BTW, the L1 IP prefetcher can detect an access pattern with negative strides and prefetch backwards accordingly. But the DCU prefetcher can't. — Hadi Brais, Jul 23 '19 at 17:21
@HadiBrais: yes, that's what I mean by "fast strings" giving up and doing 1 element at a time. The basic mechanism of getting the count from the back-end and streaming only load+store uops still applies, AFAIK. The same thing happens when `dst` is ahead of `src` by less than 16 or 32 bytes, for forward copies. (Otherwise fast strings / ERMSB still works, even when the separation is only 1 vector width, I think. I tested a while ago and it was definitely fast even with true overlap of the total region, including pretty close separation.) — Peter Cordes, Jul 23 '19 at 17:29

MOVSD performance depends on arguments

1 Answers1

Linked