why std::vector and std::string use "for" loop to copy or move elements?

Question

why std::vector and std::string use "for" loop to copy or move elements?Wouldn't this be bad for performance?

Recently, I am reading the stl source code of libcxx in llvm-libcxx.But I found that the std::vector and std::string use "for" loop to copy or move elements in the container. Wouldn't this be bad for performance compared to processing with some faster functions while the element is basic type, for example memcpy and memmove. Here is the code in std::vector

LIBCPP_INLINE_VISIBILITY static _LIBCPP_CONSTEXPR_SINCE_CXX20 char_type*       copy(char_type* __s1, const char_type* __s2, size_t __n) {
    if (!__libcpp_is_constant_evaluated()) {
        _LIBCPP_ASSERT(__s2 < __s1 || __s2 >= __s1+__n, "char_traits::copy overlapped range");
    }
    char_type* __r = __s1;
    for (; __n; --__n, ++__s1, ++__s2)
        assign(*__s1, *__s2);
    return __r;
}

I want to know why not implement a specialized version vector or string to accelerate the processing.Initially, I thought the compiler would optimize this part of code, but it doesn't. I have test the performance between for-loop and memcpy(I known which is accelerated by SSE or AVX instructions).

Here is the test code:

void func(char *p1, char *p2, int len) {
    memcpy(p1, p2, len);
}

void func(char *p1, char *p2, int len, int f) {
    for (int i = 0; i < len; i++) {
        p1[i] = p2[i];
    }
}

The Result is as follows

//(memcpy) loop for 100000000000 times, cost time 305 us
//(for-loop) loop for 100000000000 times, cost time 1967 us

I use compiler explorer to see the disassembly code. Obviously the compiler doesn't optimize. The compile flag "-Os" is common in project. enter image description here

Edit: I have retest using char_traits::copy.The performance is similar to the memcpy version. And the char_traits does have a specific version char_traits using the std::copy to copy elements.Sorry for not reading very carefully. I got it, Thank you guys.

Have you checked whether the compiler already performs such optimization when the code is compiled? I would think that rather likely. — Jesper Juhl, Apr 01 '23 at 12:26
memcpy can only be used if a class is trivially copyable. For almost all classes it will be necessary to call constructors too. — Pepijn Kramer, Apr 01 '23 at 12:30
@Jesper Juhl I have considerred about this.It seems that the compiler(MSVC, default setting) doesn't do the optimization. — grayondream, Apr 01 '23 at 12:46
Also note that the test comparion is between memcpy and array indexing. The original library code doesn't do indexing. (Apples and oranges :-) — BoP, Apr 01 '23 at 12:47
@Pepijn Kramer I known that. So I said "why not implement a specialized version vector template or string to accelerate the processing". I think it's worth. After all, vector often appears in my code. — grayondream, Apr 01 '23 at 12:48
*"(MSVC, default setting)*" The default setting doesn't do *any* optimizations. Also, libcxx is specifically written for clang, not for MSVC. — BoP, Apr 01 '23 at 12:49
Wouldn't it be bad for performance? Hard to say. With modern hardware and modern compilers - even more so modern compilers when building with optimisation - programmer intuition about what is good or bad for performance is pretty inaccurate. Even more so for small bits of code, such as a loop copying data. The only way to know for sure would be to examine what the compiler emits and (another thing needing skill to get right) to measure/profile program performance. I wouldn't accept your assertion of "bad for performance" without evidence. — Peter, Apr 01 '23 at 13:42

Vincent Saulue-Laborde · Answer 1 · 2023-04-01T21:28:42.420

About the test code/benchmark:

Replacing the copy-loop with memcpy is incorrect

Considering your test code:

void func(char *p1, char *p2, int len, int) {
    for (int i = 0; i < len; i++) {
        p1[i] = p2[i];
    }
}

A c++ compiler replacing this loop with a simple memcpy would be non-standard compliant, because:

The compiler must emit code that works as intended even if p1 and p2 are overlapping buffers (in more pedantic terms: p1[i] and p2[j] can alias the same object, even under the strict aliasing rule).
Using memcpy on 2 overlapping buffers is undefined behaviour anyways.
Given the c++ aliasing rules, func is in fact extremely different from std::memcpy, or even std::memmove (which explicitly allows overlapping buffers) in the overlapping case.

Consider the following program:

#include <iostream>
#include <string>

void func2(char * p1, char * p2, int len) {
    for (int i = 0; i < len; i++) {
        p1[i] = p2[i];
    }
}

int main() {
    std::string foobar = "0123456789";
    func2(foobar.data() + 3, foobar.data(), foobar.size() - 3);
    std::cout << foobar;
}

Live demo(godbolt)

Any standard-compliant c++ compiler must print 0120120120.

Bypassing the aliasing rule

The C language has the restrict keyword to indicate that the compiler can safely assume that an argument/variable doesn't alias anything else. restrict is not standardized in c++, but __restrict is an extension available on modern versions of gcc, clang, and MSVC:

void func2(char * __restrict p1, char * __restrict p2, int len) {
    for (int i = 0; i < len; i++) {
        p1[i] = p2[i];
    }
}

Live demo(godbolt)

gcc & clang -O2, and MSVC /O2 currently all replace this function by a memcpy.

Edit:

About the std::copy implementation:

To answer the primary question, gcc/clang/MSVC compilers are able to optimize a copy-loop into a memcpy if they can safely assume there is no aliasing between the source and the destination.

For copy between containers (std::vector / std::string / other) containing trivially copyable objects, your standard library probably provides this guarantee in some way: either in a standard way using the strict aliasing rules, or by using some compiler built-ins. For example, _LIBCPP_ASSERT in the original question internally emits a __builtin_assume for the compiler.

@n.m. To be fair, the OP has 2 quite independent questions. 1) Why is std::copy implemented that way 2) Why the compiler doesn't seem to optimize copy loop in `func`. I'll edit for point 1 later. — Vincent Saulue-Laborde, Apr 01 '23 at 16:35
The answers are "because it can" and "it does", in that order. Note the function *in the question* asserts if the ranges overlap. Your function doesn't, bit the question is not about your function. — n. m. could be an AI, Apr 01 '23 at 16:36
I\m sorry, the question does ask why a function like `func` is not optimised, I missed it. — n. m. could be an AI, Apr 01 '23 at 17:10
I don't agree with the statement **Replacing the copy-loop with `memcpy` is incorrect**. The standard requires that [the ranges do not overlap](https://eel.is/c++draft/char.traits.require#tab:char.traits.req), thus the replacing with `memcpy()` is always correct. — 273K, Apr 01 '23 at 20:19
@273K This statement is for the loop in OP's `func(char*,char*,int,int)` used for the benchmark. Your quote only describe a precondition for `std::char_traits<*>::copy(...)`, it can't be assumed by the compiler for an unrelated function. — Vincent Saulue-Laborde, Apr 01 '23 at 21:16

score 2 · Answer 2 · answered Apr 01 '23 at 20:24

why std::vector and std::string use "for" loop to copy or move elements?

They don't always. You are looking at the wrong specialization.

Wouldn't this be bad for performance

It would be, but fortunately the compilers use memmove or memcpy when appropriate

I want to know why not implement a specialized version vector or string

The compilers in fact implement specialized versions of algorithms like std::copy. There is no need to specialize std::string or std::vector.

Obviously the compiler doesn't optimize

Your function does not behave like memcpy or memmove when source and destination ranges overlap, so the compiler cannot possibly optimize it down to one of these functions. std::copy and std::char_traits::copy do behave like memcpy when the type of the elements is trivially-copied, so the compilers can and often do optimize them down.

Compilers cannot always optimize an explicit loop down to memcpy even if said loop is in fact equivalent to memcpy. Library writers help compilers by providing appropriate specializations.

--

Note, by the standard char_traits with cannot be instantiated with user-defined types, but some compilers do implement this possibility anyway.

why std::vector and std::string use "for" loop to copy or move elements?

2 Answers2

About the test code/benchmark:

Replacing the copy-loop with memcpy is incorrect

Bypassing the aliasing rule

About the std::copy implementation: