10

With gcc 5.3, both functions in the following example generate a call to memmove. Would it be inappropriate to generate a memcpy?

#include <vector>

int blackhole(const std::vector<int>&);

int copy_vec1(const std::vector<int>& v1) {
    const std::vector<int> v2{v1.begin(), v1.end()};
    return blackhole(v2);
}

int copy_vec2(const std::vector<int>& v1) {
    const auto v2 = v1;
    return blackhole(v2);
}

Example on godbolt.

Praxeolitic
  • 22,455
  • 16
  • 75
  • 126
  • I assume you meant `v1.begin(), v1.end()` ? – milleniumbug May 12 '16 at 00:12
  • @milleniumbug d'oh, thanks (question still stands) – Praxeolitic May 12 '16 at 00:12
  • What is the goal of this question? Say you get an answer, what do you plan on doing with it? – xaxxon May 12 '16 at 00:13
  • 7
    @xaxxon If there is a reason then I will add it to my collection of knowledge. – Praxeolitic May 12 '16 at 00:14
  • 2
    In the comments [here](https://blogs.msdn.microsoft.com/vcblog/2016/04/14/stl-fixes-in-vs-2015-update-2/) STL explains that code generated from `memmove` and `memcpy` are basically identical for MSVC, so it's easier to always use the former. This might also be the case with gcc. – Praetorian May 12 '16 at 00:29
  • 3
    @Praetorian: gcc doesn't have its own implementation of `memcpy` and `memmove`; it uses whatever is provided by the system's library. glibc is the library implementation used on Linux; looking at the source code, it seems that `memcpy` may be more efficient than `memmove`. – Keith Thompson May 12 '16 at 00:33
  • 1
    [This answer](http://stackoverflow.com/a/28624936/1505939) suggests a cache-related explanation – M.M May 12 '16 at 00:57
  • @M.M: that answer you linked is to a question where the OP tested `memmove` with overlapping src and dst (small offset), but `memcpy` with a separate dst buffer. So the answer applies because the OP was testing memcpy in a different situation than memmove. In the no-overlap case, they should both have the same memory access pattern. – Peter Cordes May 12 '16 at 02:14
  • @KeithThompson: gcc does have a `__builtin_memcpy` which it can inline when it wants to, if it isn't overridden by an inline definition in a header. And yes, this could happen for source code that calls `memcpy`, not `__builtin_memcpy`. Same for `sqrt`. See [`-mmemcpy-strategy=rep_byte`](https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html#index-mstringop-strategy_003d_0040var_007balg_007d-2934) for example, which would be appropriate when tuning for Intel IvyBridge and later (with fast `rep movsb`). Different settings of `-mtune` produce different behaviours wrt. inlining of block copies. – Peter Cordes May 12 '16 at 02:20
  • -1 for tagging c++14. Since a question about a specific implemantation decision is everything but not determined by the norm. – dhein May 12 '16 at 08:22
  • @Zaibis The question is about a specific decision in an implementation of C++14. The example isn't even valid C++03. Besides syntax, the version matters because gcc's libstdc++ contains macros that depend on the version. – Praxeolitic May 12 '16 at 18:53
  • Gcc will optimize memmove to memcpy when it determines that it is safe. In this case, some minor missed optimization obscures from the rest of the optimization passes that the memory you are writing to comes from operator new and thus cannot overlap the source. – Marc Glisse May 13 '16 at 05:27

3 Answers3

9

I tried compiling this code using g++ 6.1.0. I'm not at all certain of the details, but I think that the memmove call is not directly generated by the compiler; rather, it's in the code that implements <vector>.

When I preprocess the code using

/o/apps/gcc-6.1.0/bin/g++ -E -std=c++14 c.cpp

I see two calls to __builtin_memmove, both coming from .../include/c++/6.1.0/bits/stl_algobase.h. Looking at that header file, I see this comment:

// All of these auxiliary structs serve two purposes.  (1) Replace
// calls to copy with memmove whenever possible.  (Memmove, not memcpy,
// because the input and output ranges are permitted to overlap.)
// (2) If we're using random access iterators, then write the loop as
// a for loop with an explicit count.

I think what's happening is that the code invoked for copying the vector is more generally applicable to copies that can overlap (such as a call to std::move(?)).

(I have not confirmed that the memmove calls that appear in the assembly listing correspond to the __builtin_memmove calls in stl_algobase.h. I invite anyone else to follow up on that point.)

Depending on the implementation, memmove() can have some overhead relative to memcpy(), but the difference is minor. It probably just wasn't worth creating special-case code for copies that can't overlap.

Keith Thompson
  • 254,901
  • 44
  • 429
  • 631
  • 1
    How could the input/output ranges possibly overlap in this case? We're copy-constructing! – Barry May 12 '16 at 01:55
  • 2
    @Barry: It can't. I'm suggesting that the code being invoked is also used for cases where the range *can* overlap. – Keith Thompson May 12 '16 at 01:56
  • 3
    Fun fact: x86 gcc will compile `memmove` to an inline to `rep movsd` *if* [it knows that the src and dst don't overlap (`restrict`)](https://godbolt.org/g/DNia0o). Only certain `-mtune` settings ever inline memcpy/memmove in the first place, e.g. `-m32 -mtune=intel` (but not `-mtune=haswell`). – Peter Cordes May 12 '16 at 03:39
  • I guess you can have cases like this? `vector v{1, 2}; v.reserve(v.size() + 2); v.insert(v.begin() + 1, v.begin(), v.end())`. Or is this forbidden? – Johannes Schaub - litb May 12 '16 at 07:04
  • @JohannesSchaub-litb That's banned. – T.C. May 12 '16 at 08:24
  • 1
    But the existence of both `memcpy()` and `memmove()` hints that there *must* be a non-negligible difference, otherwise only `memmove()` would exist, right? – dats May 12 '16 at 09:04
  • 2
    @dats: There's likely to be *some* difference (though a conforming implementation could use exactly the same code for both). The question is whether it's worth the effort to figure out which one to use in every case. I haven't studied the generated code, or the headers, in any great depth, but I can imagine that the advantages of sharing code for different cases exceeds the minor speed benefit of using `memcpy()`. By all means use `memcpy()` if you're making an explicit call and you happen to know that the source and destination don't overlap. – Keith Thompson May 12 '16 at 15:41
9

TL;DR GCC doesn't optimize the call to memmove inside std::copy. When using two C-style arrays, it does. Replacing &v2[0] with *v2.data() allows it to optimize into a memcpy.


Your example is pretty noisy so let's strip it down:

#include <vector>
#include <algorithm>

int a[5];
int b[5];
std::vector<int> v2;

I deliberately put the variables at file scope to prevent optimizing them away without having to deal with volatile semantics.

First let's try:

std::copy(&a[0], &a[5], &b[0]);

With -O3 -fdump-tree-optimized this becomes:

__builtin_memcpy (&b[0], &a[0], 20);

Stepping through GDB shows us:

Breakpoint 1, main () at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::copy<int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=0x601080 <b>, __last=0x6010b4, __first=0x6010a0 <a>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:9
9       std::copy(&a[0], &a[0] + 5, &b[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>, 
    __first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382         __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
main () at test.cpp:10
10  }

Wait it used memmove?! OK let's keep going.

What about:

std::copy(&a[0], &a[5], v2.begin());

OK that gets us memmove:

int * _2;

<bb 2>:
_2 = MEM[(int * const &)&v2];
__builtin_memmove (_2, &a[0], 20);

Which is reflected in the assembly if we do -S. Stepping through GDB shows us the process:

(gdb) 
Breakpoint 1, main () at test.cpp:9
9   {
(gdb) s
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::copy<int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a2<false, int*, int*> (__result=<optimized out>, __last=0x6010d4, __first=0x6010c0 <a>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move_a<false, int*, int*> (__result=<optimized out>, __last=<optimized out>, __first=<optimized out>) at test.cpp:10
10      std::copy(&a[0], &a[5], &v2[0]);
(gdb) s
std::__copy_move<false, true, std::random_access_iterator_tag>::__copy_m<int> (__result=<optimized out>, __last=<optimized out>, 
    __first=<optimized out>) at /usr/include/c++/5.3.1/bits/stl_algobase.h:382
382         __builtin_memmove(__result, __first, sizeof(_Tp) * _Num);
(gdb) s
__memmove_ssse3 () at ../sysdeps/x86_64/multiarch/memcpy-ssse3.S:55

Ah I see. It's using an optimized memcpy routine provided by the C library. But wait a minute, that doesn't makes sense. memmove and memcpy are two different things!

Looking at the source code for this routine we see little checks sprinkled through out:

  85 #ifndef USE_AS_MEMMOVE
  86         cmp     %dil, %sil
  87         jle     L(copy_backward)
  88 #endif

GDB confirms that it is treating it as an memmove:

55      mov %rdi, %rax
(gdb) s
61      cmp %rsi, %rdi
(gdb) s
62      jb  L(copy_forward)
(gdb) s
63      je  L(write_0bytes)

But if we replace &v2[0] with *v2.data() it doesn't call the GLIBC's memmove. So what's going on?

Well v2[0] and v2.begin() return iterators while v2.data() returns a direct pointer to the memory. I think this for some reason prevents GCC from optimizing the memmove into a memcpy.[citation needed]

user6323422
  • 135
  • 2
  • 1
    `*v2.data()` I assume the `*` is a typo. But with `v2.data()` I see the exact same generated code as with `v2.begin()` or `&v2[0]`... – Marc Glisse May 13 '16 at 05:06
5

The rationale for the implementor to use memmove over memcpy might be flawed in this case.

memmove differs from memcpy in that the memory regions in memmove may overlap (and it's therefore conceptually slightly less efficient).

memcpy has the constraint that the two memory regions must not overlap.

In the case of the vector's copy constructor, the memory regions will never overlap so it could be argued that memcpy would be the better choice.

Richard Hodges
  • 68,278
  • 7
  • 90
  • 142
  • Maybe -- but take a look at my answer. – Keith Thompson May 12 '16 at 00:59
  • 1
    @KeithThompson Yes I think "not worth the trouble" sums it up. Vector copy constructors are unlikely to be called in tight loops, and if they are, the memory allocation cost outweighs the possible prediction-related stall in the (impossible) case of overlapping regions. So at worst he cost of invoking memmove is one redundant compare, which will be correctly predicted. – Richard Hodges May 12 '16 at 09:21