98

I am investigating performance hotspots in an application which spends 50% of its time in memmove(3). The application inserts millions of 4-byte integers into sorted arrays, and uses memmove to shift the data "to the right" in order to make space for the inserted value.

My expectation was that copying memory is extremely fast, and I was surprised that so much time is spent in memmove. But then I had the idea that memmove is slow because it's moving overlapping regions, which must be implemented in a tight loop, instead of copying large pages of memory. I wrote a small microbenchmark to find out whether there was a performance difference between memcpy and memmove, expecting memcpy to win hands down.

I ran my benchmark on two machines (core i5, core i7) and saw that memmove is actually faster than memcpy, on the older core i7 even nearly twice as fast! Now I am looking for explanations.

Here is my benchmark. It copies 100 mb with memcpy, and then moves about 100 mb with memmove; source and destination are overlapping. Various "distances" for source and destination are tried. Each test is run 10 times, the average time is printed.

https://gist.github.com/cruppstahl/78a57cdf937bca3d062c

Here are the results on the Core i5 (Linux 3.5.0-54-generic #81~precise1-Ubuntu SMP x86_64 GNU/Linux, gcc is 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5). The number in brackets is the distance (gap size) between source and destination:

memcpy        0.0140074
memmove (002) 0.0106168
memmove (004) 0.01065
memmove (008) 0.0107917
memmove (016) 0.0107319
memmove (032) 0.0106724
memmove (064) 0.0106821
memmove (128) 0.0110633

Memmove is implemented as a SSE optimized assembler code, copying from back to front. It uses hardware prefetch to load the data into the cache, and copies 128 bytes to XMM registers, then stores them at the destination.

(memcpy-ssse3-back.S, lines 1650 ff)

L(gobble_ll_loop):
    prefetchnta -0x1c0(%rsi)
    prefetchnta -0x280(%rsi)
    prefetchnta -0x1c0(%rdi)
    prefetchnta -0x280(%rdi)
    sub $0x80, %rdx
    movdqu  -0x10(%rsi), %xmm1
    movdqu  -0x20(%rsi), %xmm2
    movdqu  -0x30(%rsi), %xmm3
    movdqu  -0x40(%rsi), %xmm4
    movdqu  -0x50(%rsi), %xmm5
    movdqu  -0x60(%rsi), %xmm6
    movdqu  -0x70(%rsi), %xmm7
    movdqu  -0x80(%rsi), %xmm8
    movdqa  %xmm1, -0x10(%rdi)
    movdqa  %xmm2, -0x20(%rdi)
    movdqa  %xmm3, -0x30(%rdi)
    movdqa  %xmm4, -0x40(%rdi)
    movdqa  %xmm5, -0x50(%rdi)
    movdqa  %xmm6, -0x60(%rdi)
    movdqa  %xmm7, -0x70(%rdi)
    movdqa  %xmm8, -0x80(%rdi)
    lea -0x80(%rsi), %rsi
    lea -0x80(%rdi), %rdi
    jae L(gobble_ll_loop)

Why is memmove faster then memcpy? I would expect memcpy to copy memory pages, which should be much faster than looping. In worst case I would expect memcpy to be as fast as memmove.

PS: I know that I cannot replace memmove with memcpy in my code. I know that the code sample mixes C and C++. This question is really just for academic purposes.

UPDATE 1

I ran some variations of the tests, based on the various answers.

  1. When running memcpy twice, then the second run is faster than the first one.
  2. When "touching" the destination buffer of memcpy (memset(b2, 0, BUFFERSIZE...)) then the first run of memcpy is also faster.
  3. memcpy is still a little bit slower than memmove.

Here are the results:

memcpy        0.0118526
memcpy        0.0119105
memmove (002) 0.0108151
memmove (004) 0.0107122
memmove (008) 0.0107262
memmove (016) 0.0108555
memmove (032) 0.0107171
memmove (064) 0.0106437
memmove (128) 0.0106648

My conclusion: based on a comment from @Oliver Charlesworth, the operating system has to commit physical memory as soon as the memcpy destination buffer is accessed for the very first time (if someone knows how to "proof" this then please add an answer!). In addition, as @Mats Petersson said, memmove is cache friendlier than memcpy.

Thanks for all the great answers and comments!

cruppstahl
  • 2,447
  • 1
  • 19
  • 25
  • 2
    You looked at the memmove code, did you also look at the memcpy code? – Oliver Charlesworth Feb 20 '15 at 07:56
  • Have you tried moving the memcpy test to last? That makes it slightly faster in visual studio, otherwise memmove is slightly faster. – Retired Ninja Feb 20 '15 at 07:56
  • Not a duplicate, that was looking at 10k memory, which would easily fit in cache. – Oliver Charlesworth Feb 20 '15 at 08:02
  • 9
    _My expectation was that copying memory is extremely fast_ - only when memory is in L1 cache. When the data does not fit in caches your copying performance dwindles. – Maxim Egorushkin Feb 20 '15 at 08:19
  • 1
    BTW, you only copied one branch of `memmove`. This branch can't handle move when source overlaps destination and the destination is at lower addresses. – Maxim Egorushkin Feb 20 '15 at 08:23
  • 2
    I haven't had time to access a Linux machine, so I can't test this theory yet. But another possible explanation is *overcommitting*; your `memcpy` loop is the first time that the contents of `b2` is accessed, thus the OS has to commit physical memory for it as it goes. – Oliver Charlesworth Feb 20 '15 at 08:26
  • Maxim Egorushkin - the destination is at a higher address. I stepped through memmove with gdb and always ended up at the branch that i showed. – cruppstahl Feb 20 '15 at 08:26
  • Oliver Charlesworth - that seems to be the case. I am running the memcpy test twice. The first time it's slow, the second time it's en par with memmove. Do you want to create a separate answer for this? – cruppstahl Feb 20 '15 at 08:30
  • 2
    PS: If this is a bottleneck I would reconsider the approach. How about putting the values into a list or tree structure (e.g. binary tree) and then reading them into an array at the end. The nodes in such an approach would be an excellent candidate for pool allocation. They are only added until the end when they're released en masse. That's particularly true if you know how many you will need at the start. The boost libraries have a pool allocator. – Persixty Feb 20 '15 at 08:48
  • 1
    I'm wondering whether plain arrays are the right data structure if you find yourself shifting millions of 32bit values all the time. Are you trying to remove elements from the beginning of the array or something? – Frerich Raabe Feb 20 '15 at 08:56
  • @Dan Allen, Frerich Raabe - the memmove is called when inserting values in a btree leaf. One solution to improve performance is organizing the leaf as a sequence of small blocks (which can be resized or moved around). But my question was not about improving performance, only about getting an explanation for the benchmark result. – cruppstahl Feb 20 '15 at 09:14
  • I'm curious why a question so heavily depending on the library implementation doesn't even mention *which* library it's referring to (and takes quite some time before it mentions hardware, OS, and compiler involved). The source link points to Embedded GLIBC v2.13 -- is that what you're actually using? – DevSolar Feb 20 '15 at 10:29
  • 1
    @cruppstahl: Interesting. I did 'PS' because I realize I wasn't answering your question - which is itself interesting - and it can be annoying when you ask a reasonable question and someone tells you to redesign the application. Do you mean organizing leaf or node (I realize b-tree terminology is not used consistently)? One meaning says you don't need to reorganize the leaves (if they are the data pages) only nodes (if they are index). However there certainly are tuning options to reduce the amounting of memory churn. If your code is short enough post it on code review. – Persixty Feb 20 '15 at 11:57
  • 1
    @Dan Allen - 50% of my application is spent when the btree leaf calls memmove to insert a key. My btree is parameterized and leaf nodes really are just a uint32_t[] array under the hood. I use memmove to insert the key in this array. Splitting this array into smaller chunks and organizing them in an index (one index per btree leaf) might avoid large memmoves. Feel free to send me a mail (chris @ crupp.de) if you want to see code or need more information. The database is hamsterdb (hamsterdb.com). – cruppstahl Feb 20 '15 at 12:44

4 Answers4

61

Your memmove calls are shuffling memory along by 2 to 128 bytes, while your memcpy source and destination are completely different. Somehow that's accounting for the performance difference: if you copy to the same place, you'll see memcpy ends up possibly a smidge faster, e.g. on ideone.com:

memmove (002) 0.0610362
memmove (004) 0.0554264
memmove (008) 0.0575859
memmove (016) 0.057326
memmove (032) 0.0583542
memmove (064) 0.0561934
memmove (128) 0.0549391
memcpy 0.0537919

Hardly anything in it though - no evidence that writing back to an already faulted in memory page has much impact, and we're certainly not seeing a halving of time... but it does show that there's nothing wrong making memcpy unnecessarily slower when compared apples-for-apples.

Tony Delroy
  • 102,968
  • 15
  • 177
  • 252
  • I would have expected that the CPU caches are not causing the difference because my buffers are much larger than the caches. – cruppstahl Feb 20 '15 at 07:56
  • 2
    But each requires the same total number of main memory accesses, right? (I.e. 100MB of read, and 100MB of write). The cache pattern doesn't get round that. So the only way that one could be slower than the other is if some stuff has to be read/written from/to memory more than once. – Oliver Charlesworth Feb 20 '15 at 07:57
  • 2
    @Tony D - My conclusion was to ask people who are smarter than me ;) – cruppstahl Feb 20 '15 at 08:03
  • It feels pretty relevant; a cache only stalls because it's waiting for stuff to be read from main memory (or the next-level cache); in both cases we're reading the same amount. – Oliver Charlesworth Feb 20 '15 at 08:03
  • Is the problem actually that the source/destination regions are aliasing in the (original) memcpy case? – Oliver Charlesworth Feb 20 '15 at 08:08
  • @OliverCharlesworth: maybe you're right... we're certainly not seeing significantly better performance for `memcpy`, so if there is any benefit from having read the same pages into cache that are being written to, it's subtle. I'll edit my answer to leave this an open question and leave it to someone with more CPU architecture knowledge to address.... Thanks for challenging my intuitions. Cheers. – Tony Delroy Feb 20 '15 at 08:23
  • 1
    Also, what happens if you copy to the same place, but do `memcpy` first again? – Oliver Charlesworth Feb 20 '15 at 08:27
  • 1
    @OliverCharlesworth: the first test run always takes a significant hit, but doing two memcpy tests: memcpy 0.0688002 0.0583162 | memmove 0.0577443 0.05862 0.0601029... see http://ideone.com/8EEAcA – Tony Delroy Feb 20 '15 at 08:33
28

When you are using memcpy, the writes need to go into the cache. When you use memmove where when you are copying a small step forward, the memory you are copying over will already be in the cache (because it was read 2, 4, 16 or 128 bytes "back"). Try doing a memmove where the destination is several megabytes (> 4 * cache size), and I suspect (but can't be bothered to test) that you'll get similar results.

I guarantee that ALL is about cache maintenance when you do large memory operations.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • +1 I think for the reasons you mentioned, a backwards looping memmove is cache friendlier than memcpy. However, I discovered that when running the memcpy test twice, the second run is as fast as memmove. Why? The buffers are so large that a second run of memcpy should be as inefficient (cache-wise) as the first run. So it seems there are additional factors here which cause the performance penalty. – cruppstahl Feb 20 '15 at 09:49
  • 3
    Given the right circumstances, a second `memcpy` will be notably faster simply because the TLB is prefilled. Also, a second `memcpy` won't have to empty the cache of stuff you may need "getting rid of" (dirty cache-lines are "bad" for performance in so many ways. To say for sure, however, you'd need to run something like "perf" and sample things like cache-misses, TLB misses and so on. – Mats Petersson Feb 20 '15 at 20:15
17

Historically, memmove and memcpy are the same function. They worked in the same way and had the same implementation. It was then realised that memcpy doesn't need to be (and frequently wasn't) defined to handle overlapping areas in any particular way.

The end result is that memmove was defined to handle overlapping regions in a particular way even if this impacts performance. memcpy is supposed to use the best algorithm available for non-overlapping regions. The implementations are normally almost identical.

The problem you have run into is that there are so many variations of the x86 hardware that it is impossible to tell which method of shifting memory around will be the fastest. And even if you think you have a result in one circumstance something as simple as having a different 'stride' in the memory layout can cause vastly different cache performance.

You can either benchmark what you're actually doing or ignore the problem and rely on the benchmarks done for the C library.

Edit: Oh, and one last thing; shifting lots of memory contents around is VERY slow. I would guess your application would run faster with something like a simple B-Tree implementation to handle your integers. (Oh you are, okay)

Edit2: To summarise my expansion in the comments: The microbenchmark is the issue here, it isn't measuring what you think it is. The tasks given to memcpy and memmove differ significantly from each other. If the task given to memcpy is repeated several times with memmove or memcpy the end results will not depend on which memory shifting function you use UNLESS the regions overlap.

ks1322
  • 33,961
  • 14
  • 109
  • 164
user3710044
  • 2,261
  • 15
  • 15
  • But that's what it's about - i am benchmarking what i'm actually doing. This question is about interpreting the results of the benchmark, which contradict what you are claiming - that memcpy is faster for non-overlapping regions. – cruppstahl Feb 20 '15 at 08:20
  • My application *is* a b-tree! Whenever integers are inserted in a leaf node memmove is called to make space. I am working on a database engine. – cruppstahl Feb 20 '15 at 08:21
  • 1
    You're using a micro benchmark and you're not even having the memcopy and memmove shift the same data. The exact locations in memory that the data you're coping reside makes a difference to the caching and how many round trips to memory the CPU has to make. – user3710044 Feb 20 '15 at 08:24
  • While this answer is correct, it doesn't actually explain *why* it's slower in this case, it's essentially saying "it's slower because in some cases it might be slower". – Oliver Charlesworth Feb 20 '15 at 08:24
  • I'm saying that for the same circumstances, including the same layout of memory to copy/move the benchmarks WILL be the same because the implementations are the same. The problem is in the microbenchmark. – user3710044 Feb 20 '15 at 08:26
  • The implementation is not the same. They are in different assembler files with completely different code. – cruppstahl Feb 20 '15 at 08:28
  • @cruppstahl, Ah okay then. How about this ... forward copies (ie memcpy's usual implementation) are sometimes faster. It MAY be faster to copy the B-Tree node to new space; beware, however, this might increase the size of the cache footprint which would spoil it. But you probably can't tell for sure with a microbenchmark. – user3710044 Feb 20 '15 at 08:33
  • @cruppstahl, Try this.. replace the memcopy in your microbenchmark with memmove. Are you surprised at the result ? – user3710044 Feb 20 '15 at 08:34
  • @user3710044 - another solution will be to split the node into small blocks (i.e. 512 bytes each) to avoid large memmoves in favour of smaller ones. I have similar structures already in place for other data types, and this works well. – cruppstahl Feb 20 '15 at 08:42
  • @user3710044 - no i am not... I just discovered the same effect when running the memcpy test twice - the first time is slow, the second time is faster. – cruppstahl Feb 20 '15 at 08:43
  • @cruppstahl, darn, that's there too, looks like you'd have to do a memcpy/memmove/memcpy/memmove list to see what I mean. The first one is slower; the 2nd 3rd and later are all the same speed be they memmove or memcpy. That's what I see if I repeat the first memcpy with memmove or memcpy. The performance is the same when they are doing exactly the same job. (PS: stackoverflw thinks this is getting too long ... chat?) – user3710044 Feb 20 '15 at 08:49
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/71318/discussion-between-cruppstahl-and-user3710044). – cruppstahl Feb 20 '15 at 09:23
  • Seems our proxy at work blocks the chat. Here's what i wrote: @user3710044 - sorry, maybe my comment was misleading. It actually proves your point (and those made by others). In the sequence memcpy/memmove..., the first memcpy is very slow. When runnign memcpy/memcpy/memmove... then the second memcpy is fast. Which means that it's not the fault of the memcpy implementation, but the fault of the operating system/architecture Oliver Charlesworth said that "another possible explanation is overcommitting", another explanation is caching/memory roundtrips, as you wrote. – cruppstahl Feb 20 '15 at 09:34
  • False, memmove and memcopy are different. One takes into account overlapping while the other does not take overlapping into account. If the functions are the same, then the library creator was lazy. memmove is when you expect overlap. memcpy when you know there is no overlap. Thus, if properly done, memcpy should be faster as there is no check for overlap. and could be faster at hardware level as well (rep movsd is optimized in the cpu in X86/X64, for example) – rxantos Mar 18 '19 at 09:55
2

"memcpy is more efficient than memmove." In your case, you most probably are not doing the exact same thing while you run the two functions.

In general, USE memmove only if you have to. USE it when there is a very reasonable chance that the source and destination regions are over-lapping.

Reference: https://www.youtube.com/watch?v=Yr1YnOVG-4g Dr. Jerry Cain, (Stanford Intro Systems Lecture - 7) Time: 36:00

Ehsan
  • 1,338
  • 14
  • 13
  • Actually, if there is even a tiny tiny chance that the regions may overlap, you should always use memmove. Because using memcpy on overlaping buffers gives you undefined behavior. – green lantern Jun 01 '23 at 14:02