I took a look at actual g++ -O3
output for your code, to see just how bad it was.
char*
can alias anything, so even the __restrict__
GNU C++ extension can't help the compiler hoist the strlen
out of the loop.
I was thinking it would be hoisted, and expecting that the major inefficiency here was just the byte-at-a-time copy loop. But no, it's really as bad as the other answers suggest. m_pName
even has to be re-loaded every time, because the aliasing rules allow m_pName[i]
to alias this->m_pName
. The compiler can't assume that storing to m_pName[i]
won't change class member variables, or the src string, or anything else.
#include <string.h>
class foo {
char *__restrict__ m_pName = nullptr;
void set_name(const char *__restrict__ pName);
void alloc_name(size_t sz) { m_pName = new char[sz]; }
};
// g++ will only emit a non-inline copy of the function if there's a non-inline definition.
void foo::set_name(const char * __restrict__ pName)
{
// char* can alias anything, including &m_pName, so the loop has to reload the pointer every time
//char *__restrict__ dst = m_pName; // a local avoids the reload of m_pName, but still can't hoist strlen
#define dst m_pName
for (unsigned int i = 0; i < strlen(pName); i++)
dst[i] = pName[i];
}
Compiles to this asm (g++ -O3 for x86-64, SysV ABI):
...
.L7:
movzx edx, BYTE PTR [rbp+0+rbx] ; byte load from src. clang uses mov al, byte ..., instead of movzx. The difference is debatable.
mov rax, QWORD PTR [r12] ; reload this->m_pName
mov BYTE PTR [rax+rbx], dl ; byte store
add rbx, 1
.L3: ; first iteration entry point
mov rdi, rbp ; function arg for strlen
call strlen
cmp rbx, rax
jb .L7 ; compare-and-branch (unsigned)
Using an unsigned int
loop counter introduces an extra mov ebx, ebp
copy of the loop counter, which you don't get with either int i
or size_t i
, in both clang and gcc. Presumably they have a harder time accounting for the fact that unsigned i
could produce an infinite loop.
So obviously this is horrible:
- a
strlen
call for every byte copied
- copying one byte at a time
- reloading
m_pName
every time through the loop (can be avoided by loading it into a local).
Using strcpy
avoids all these problems, because strlen
is allowed to assume that it's src and dst don't overlap. Don't use strlen
+ memcpy
unless you want to know strlen
yourself. If the most efficient implementation of strcpy
is to strlen
+ memcpy
, the library function will internally do that. Otherwise, it will do something even more efficient, like glibc's hand-written SSE2 strcpy
for x86-64. (There is a SSSE3 version, but it's actually slower on Intel SnB, and glibc is smart enough not to use it.) Even the SSE2 version may be unrolled more than it should be (great on microbenchmarks, but pollutes the instruction cache, uop-cache, and branch-predictor caches when used as a small part of real code). The bulk of the copying is done in 16B chunks, with 64bit, 32bit, and smaller, chunks in the startup/cleanup sections.
Using strcpy
of course also avoids bugs like forgetting to store a trailing '\0'
character in the destination. If your input strings are potentially gigantic, using int
for the loop counter (instead of size_t
) is also a bug. Using strncpy
is generally better, since you often know the size of the dest buffer, but not the size of the src.
memcpy
can be more efficient than strcpy
, since rep movs
is highly optimized on Intel CPUs, esp. IvB and later. However, scanning the string to find the right length first will always cost more than the difference. Use memcpy
when you already know the length of your data.