Is copying in a loop less efficient than memcpy()?

Question

I started to study IT and I am discussing with a friend right now whether this code is inefficient or not.

// const char *pName
// char *m_pName = nullptr;

for (int i = 0; i < strlen(pName); i++)
    m_pName[i] = pName[i];

He is claiming that for example memcopy would do the same like the for loop above. I wonder if that's true, I don't believe.

If there are more efficient ways or if this is inefficient, please tell me why!

Thanks in advance!

Possible duplicate of [memcpy vs for loop - What's the proper way to copy an array from a pointer?](http://stackoverflow.com/questions/4729046/memcpy-vs-for-loop-whats-the-proper-way-to-copy-an-array-from-a-pointer) — Inisheer, Feb 13 '16 at 20:52
why don't you just check a memcpy implementation? as an example: https://fossies.org/dox/glibc-2.22/string_2memcpy_8c_source.html https://fossies.org/dox/glibc-2.22/generic_2memcopy_8h.html — murphy, Feb 13 '16 at 20:56
The point about using `memcpy()` is, that it's usually heavily optimized. `memcpy()` will, for instance, not read/write single bytes like your loop does. But in your case, you should be looking at `strncpy()` instead, because that will avoid the second pass through the string to compute its length. — cmaster - reinstate monica, Feb 13 '16 at 20:57

Peter Cordes · Accepted Answer · 2016-02-14T21:12:25.873

I took a look at actual g++ -O3 output for your code, to see just how bad it was.

char* can alias anything, so even the __restrict__ GNU C++ extension can't help the compiler hoist the strlen out of the loop.

I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName even has to be re-loaded every time, because the aliasing rules allow m_pName[i] to alias this->m_pName. The compiler can't assume that storing to m_pName[i] won't change class member variables, or the src string, or anything else.

#include <string.h>
class foo {
   char *__restrict__ m_pName = nullptr;
   void set_name(const char *__restrict__ pName);
   void alloc_name(size_t sz) { m_pName = new char[sz]; }
};

// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
    // char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
    //char *__restrict__ dst = m_pName;  // a local avoids the reload of m_pName, but still can't hoist strlen
    #define dst m_pName
    for (unsigned int i = 0; i < strlen(pName); i++)
        dst[i] = pName[i];
}

Compiles to this asm (g++ -O3 for x86-64, SysV ABI):

...
.L7:
        movzx   edx, BYTE PTR [rbp+0+rbx]      ; byte load from src.  clang uses mov al, byte ..., instead of movzx.  The difference is debatable.
        mov     rax, QWORD PTR [r12]           ; reload this->m_pName    
        mov     BYTE PTR [rax+rbx], dl         ; byte store
        add     rbx, 1
.L3:                                 ; first iteration entry point
        mov     rdi, rbp                       ; function arg for strlen
        call    strlen
        cmp     rbx, rax
        jb      .L7               ; compare-and-branch (unsigned)

Using an unsigned int loop counter introduces an extra mov ebx, ebp copy of the loop counter, which you don't get with either int i or size_t i, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i could produce an infinite loop.

So obviously this is horrible:

a strlen call for every byte copied
copying one byte at a time
reloading m_pName every time through the loop (can be avoided by loading it into a local).

Using strcpy avoids all these problems, because strlen is allowed to assume that it's src and dst don't overlap. Don't use strlen + memcpy unless you want to know strlen yourself. If the most efficient implementation of strcpy is to strlen + memcpy, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.

Using strcpy of course also avoids bugs like forgetting to store a trailing '\0' character in the destination. If your input strings are potentially gigantic, using int for the loop counter (instead of size_t) is also a bug. Using strncpy is generally better, since you often know the size of the dest buffer, but not the size of the src.

memcpy can be more efficient than strcpy, since rep movs is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy when you already know the length of your data.

Jerry Coffin · Answer 2 · 2016-02-13T21:34:58.767

At best it's somewhat inefficient. At worst, it's quite inefficient.

In the good case, the compiler recognizes that it can hoist the call to strlen out of the loop. In this case, you end up traversing the input string once to compute the length, and then again to copy to the destination.

In the bad case, the compiler calls strlen every iteration of the loop, in which case the complexity becomes quadratic instead of linear.

As far as how to do it efficiently, I'd tend to so something like this:

char *dest = m_pName;

for (char const *in = pName; *in; ++in)
    *dest++ = *in;
*dest++ = '\0';

This traverses the input only once, so it's potentially about twice as fast as the first, even in the better case (and in the quadratic case, it can be many times faster, depending on the length of the string).

Of course, this is doing pretty much the same thing as strcpy would. That may or may not be more efficient still--I've certainly seen cases where it was. Since you'd normally assume strcpy is going to be used quite a lot, it can be worthwhile to spend more time optimizing it than some random guy on the internet typing in an answer in a couple minutes.

g++ doesn't hoist `strlen` out of the loop. I'm pretty sure this is because aliasing rules allow `char*` to alias anything. See my answer. Also, calling `strcpy` will get you an optimized library implementation that will perform much better for large strings (e.g. copying 16B at a time with SSE). Writing out byte-copy loops "manually" should be avoided when there's a library function that might be optimized. — Peter Cordes, Feb 14 '16 at 17:45

πάντα ῥεῖ · Answer 3 · 2016-02-13T21:09:37.420

Depends on interpretation of efficiency. I'd claim using memcpy() or strcpy() more efficient, because you don't write such loops every time you need a copy.

He is claiming that for example memcopy would do the same like the for loop above.

Well, not exactly the same. Probably, because memcpy() takes the size once, while strlen(pName) might be called with every loop iteration potentially. Thus from potential performance efficiency considerations memcpy() would be better.

BTW from your commented code:

// char *m_pName = nullptr;

Initializing like that would lead to undefined behavior without allocating memory for m_pName:

char *m_pName = new char[strlen(pName) + 1];

Why the +1? Because you have to consider putting a '\0' indicating the end of the c-style string.

score 1 · Answer 4 · answered Feb 13 '16 at 20:57

Yes, your code is inefficient. Your code takes what is called "O(n^2)" time. Why? You have the strlen() call in your loop, so your code is recalculating the length of the string every single loop. You can make it faster by doing this:

unsigned int len = strlen(pName);
for (int i = 0; i < len; i++)
    m_pName[i] = pName[i];

Now, you calculate the string length only once, so this code takes "O(n)" time, which is much faster than O(n^2). This is now about as efficient as you can get. However, A memcpy call would still be 4-8 times faster, because this code copies 1 byte at a time, whereas memcpy will use your system's word length.

score 1 · Answer 5 · edited May 23 '17 at 12:25

Yes, it's inefficient, not because you're using a loop instead of memcpy but because you're calling strlen on each iteration. strlen loops over the entire array until it finds the terminating zero byte.

Also, it's very unlikely that the strlen will be optimized out of the loop condition, see In C++, should I bother to cache variables, or let the compiler do the optimization? (Aliasing).

So memcpy(m_pName, pName, strlen(pName)) would indeed be faster.

Even faster would be strcpy, because it avoids the strlen loop:

strcpy(m_pName, pName);

strcpy does the same as the loop in @JerryCoffin's answer.

sh1 · Answer 6 · 2016-02-14T18:40:48.737

For simple operations like that you should almost always say what you mean and nothing more.

In this instance if you had meant strcpy() then you should have said that, because strcpy() will copy the terminating NUL character, whereas that loop will not.

Neither one of you can win the debate. A modern compiler has seen a thousand different memcpy() implementations and there's a good chance it's just going to recognise yours and replace your code either with a call to memcpy() or with its own inlined implementation of the same.

It knows which one is best for your situation. Or at least it probably knows better than you do. When you second-guess that you run the risk of the compiler failing to recognise it and your version being worse than the collected clever tricks the compiler and/or library knows.

Here are a few considerations that you have to get right if you want to run your own code instead of the library code:

What's the largest read/write chunk size that is efficient (it's rarely bytes).
For what range of loop lengths is it worth the trouble of pre-aligning reads and writes so that larger chunks can be copied?
Is it better to align reads, align writes, do nothing, or to align both and perform permutations in arithmetic to compensate?
What about using SIMD registers? Are they faster?
How many reads should be performed before the first write? How much register file needs to be used for the most efficient burst accesses?
Should a prefetch instruction be included?
- How far ahead?
- How often?
- Does the loop need extra complexity to avoid preloading over the end?
How many of these decisions can be resolved at run-time without causing too much overhead? Will the tests cause branch prediction failures?
Would inlining help, or is that just wasting icache?
Does the loop code benefit from cache line alignment? Does it need to be packed tightly into a single cache line? Are there constraints on other instructions within the same cache line?
Does the target CPU have dedicated instructions like rep movsb which perform better? Does it have them but they perform worse?

Going further; because memcpy() is such a fundamental operation it's possible that even the hardware will recognise what the compiler's trying to do and implement its own shortcuts that even the compiler doesn't know about.

~~Don't worry about the superfluous calls to strlen(). Compiler probably knows about that, too.~~ (Compiler should know in some instances, but it doesn't seem to care) Compiler sees all. Compiler knows all. Compiler watches over you while you sleep. ~~Trust the compiler.~~

Oh, except the compiler might not catch that null pointer reference. Stupid compiler!

+1 for mentioning copying in bigger chunks. However, it turns out that `strlen` can't be hoisted out, because `char*` can alias anything. Also, no, CPUs don't generally recognize block-copy loops and peephole-optimize them. I've never heard of any architecture doing that. Aliasing rules mean this loop has different semantics from `strcpy`, so the compiler can't replace it with something optimal. — Peter Cordes, Feb 14 '16 at 17:41
@PeterCordes, for `strlen()` you're quite right. The compiler _does_ detect non-aliasing conditions and do the hoist, but those conditions are rarer than I'd thought (mostly through failure to terminate the string before the end of the object, but then the outcome should be undefined anyway so hoisting should be benign... but gcc doesn't see it that way). I don't expect a CPU to peephole anything (although soft ones might), but to implement pipeline optimisations for known idioms. They often go undocumented, for various reasons, though. — sh1, Feb 14 '16 at 18:03
What would you call a "pipeline optimization"? http://agner.org/optimize/ documents the pipelines of modern x86 CPUs in pretty fair detail. There's nothing special about a copy loop that the pipeline and OOO machinery handles differently. It's just a bunch of byte loads and stores. It will have extra slowdowns if there is false aliasing between the load and store addresses. (e.g. offset by 4k) — Peter Cordes, Feb 14 '16 at 18:17
@PeterCordes proprietary CPUs and DSPs, mostly. Typically in-order pipelines that would otherwise stumble on tight memcpy loops because of the round trip through the register file. — sh1, Feb 14 '16 at 18:22
Oh, and also where the icache or instruction buffer have funny constraints and you have to ensure that the loop is fully within a single cache line and that there are no other branches in the same cache line or buffer. Not _exactly_ a pipeline optimisation, but constraints that the compiler won't impose automatically on _every_ branch or label because it would burn up so much icache. — sh1, Feb 14 '16 at 18:39

score 1 · Answer 7 · edited Feb 18 '17 at 17:13

This code is confused in various ways.

Just do m_pName = pName; because you're not actually copying the string. You're just pointing to the one you've already got.
If you want to copy the string m_pName = strdup(pName); would do it.
If you already have storage, strcpy or memcpy would do it.
In any case, get strlen out of the loop.
This is the wrong time to worry about performance. First get it right.
If you insist on worrying about performance, it's hard to beat strcpy. What's more, you don't have to worry about it being right.

score 0 · Answer 8 · answered Feb 13 '16 at 20:57

0

As a matter of fact, why do you need to copy at all ??? (either with the loop or memcpy)

if you want to duplicate a memory block, thats a different question, but since its a pointer all you need is &pName[0] (which is the address of the first location of the array) and sizeof pName ... thats it ... you can reference any object in the array by incrementing the address of first byte and you know the limit using the size value ... why have all these pointers ???(let me know if there is more to that than theoretical debate)

answered Feb 13 '16 at 20:57

a.atlam

742
5
17

Sometimes you do need to copy, e.g. before reusing the src buffer for something else. Minimizing copies is a good suggestion, but not a complete answer to this question. – Peter Cordes Feb 14 '16 at 16:36
yes very true ... but he is starting out the code with a pointer to the beginning of a string (end of string defined by the null termination) which is a very efficient system where you can fully describe a memory block using only starting address and the null termination, and trying to make a pointer to each and every byte of the string till the null termination. (why on earth ??) if you want to make a copy, then copy the block not the pointers ... if you want to have an extra pointer, then copy only first address, ie new pointer ... but an array of pointers ??? that was beyond me !! – a.atlam Feb 14 '16 at 17:45
No, his loop is just a badly-written `strcpy`. He's not storing a pointer to each char. – Peter Cordes Feb 14 '16 at 17:47
ohhhh ... I see now ... seems I have been working here for too long !! – a.atlam Feb 14 '16 at 17:58

Is copying in a loop less efficient than memcpy()?

8 Answers8