faster way than memcpy to copy 0-terminated string

Question

I have a question about duplicating a 0-terminated string:

const char * str = "Hello World !";
size_t getSize = strlen(str);
char * temp = new char[getSize + 1];

... i know i can use this function

memcpy(temp, str, getSize);

but i want to use my own copy function which have action like this

int Count = 0;
while (str[Count] != '\0') {
    temp[Count] = str[Count];
    Count++;
}

both way's are true and success. now i want to check it on 10 milions times and for memcpy do this action

const char * str = "Hello World !";
size_t getSize = strlen(str);
for (size_t i = 0; i < 10000000; i++) {
    char * temp = new char[getSize + 1];
    memcpy(temp, str, getSize);
}

and this is for my own way

    const char * str = "Hello World !";
    size_t getSize = strlen(str);
    for (size_t i = 0; i < 10000000; i++) {
        char * temp = new char[getSize + 1];
        int Count = 0;
        while (str[Count] != '\0') {
            temp[Count] = str[Count];
            Count++;
        }
    }

first process done in 420 miliseconds and second done in 650 miliseconds ... why? both of those ways are same ! i want to use my own function not memcpy. is there any way to make my own way faster (fast as memcpy is fast or maybe faster)? how can i update my own way (while) to make it faster or equal with memcpy?

full source

int main() {

    const char * str = "Hello world !";
    size_t getSize = strlen(str);

    auto start_t = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 10000000; i++) {
        char * temp = new char[getSize + 1];
        memcpy(temp, str, getSize);
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";


    start_t = chrono::high_resolution_clock::now();
    for (size_t i = 0; i < 10000000; i++) {
        char * temp = new char[getSize + 1];
        int done = 0;
        while (str[done] != '\0') {
            temp[done] = str[done];
            done++;
        }
    }
    cout << chrono::duration_cast<chrono::milliseconds>(chrono::high_resolution_clock::now() - start_t).count() << " milliseconds\n";

    return 0;
}

results:

482 milliseconds
654 milliseconds

Relying on a `'\0'` character at the end of the array, doesn't do the same as `memcpy()` does. If you want to handle this case only you're probably better off with `strcpy()` than rolling your own function (there may be certain tricks used in the implementation that are make it even faster than your implementation). — user0042, Jul 16 '17 at 10:43
Why do you think you can outsmart the creators of your compiler's standard library? — PaulMcKenzie, Jul 16 '17 at 10:43
Why do you want to use your own function if memcpy is faster? — melpomene, Jul 16 '17 at 10:44
i don't want to use memcpy ... or strcpy ... i want to make my own action which is (while) until arrive to the end ... — myOwnWays, Jul 16 '17 at 10:45
micro-ptimisation is a hell to nowhere. SIXSIGMA read up on that — Ed Heal, Jul 16 '17 at 10:45
@AlirezaSaeedipour *i want to make my own action* -- Then that is *your* homework you've made for yourself. As you can see, writing fast functions requires much more than knowing how to write a loop. — PaulMcKenzie, Jul 16 '17 at 10:47
for my other program behaviors ...and also i want to know why ... memcpy is magic? im sure memcpy do this way to (check each char and copy one by one) ... — myOwnWays, Jul 16 '17 at 10:48
I'm pretty sure that the cost of `new` dwarfs the cost of copying 14 bytes! — Richard Hodges, Jul 16 '17 at 10:48
@myOwnWays Many compiler implementations use assembly language to copy buffers. Again, why do you think you can outsmart some of the best programmers in the industry? — PaulMcKenzie, Jul 16 '17 at 10:49

Sergey Kalinichenko · Answer 1 · 2017-07-16T10:51:01.610

4

Replacing library functions with your own often leads to inferior performance.

memcpy represents a very fundamental memory operation. Because of that, it is highly optimized by its authors. Unlike a "naïve" implementation, library version moves more than a single byte at a time whenever is possible, and uses hardware assistance on platforms where one is available.

Moreover, compiler itself "knows" about the inner workings of memcpy and other library functions, and it can optimize them out completely for cases when the length is known at compile time.

Note: Your implementation has semantics of strcpy, not memcpy.

edited Jul 16 '17 at 10:51

answered Jul 16 '17 at 10:48

Sergey Kalinichenko

714,442
84
1,110
1,523

ok i want to know how !!! 100 % memcpy check each char too ! (to copy each one by one ... ) so why it must be faster !!! – myOwnWays Jul 16 '17 at 10:49
1

@myOwnWays Read the library implementation(s) source code and/or check the generated assembler in release (optimised) build. – Richard Critten Jul 16 '17 at 10:51
3

@myOwnWays _"memcpy check each char too !"_ Huh? No, it doesn't. – user0042 Jul 16 '17 at 10:51
4

@myOwnWays No, it does not - `memcpy` goes by length, not by null terminator (it's `strcpy` that goes by null terminator). – Sergey Kalinichenko Jul 16 '17 at 10:51

user0042 · Answer 2 · 2017-07-16T10:53:28.520

1

... both of those ways are same !

No, they aren't:

memcpy() doesn't check each character to contain '\0' or not.
There may be more optimizations done by the implementers than you have in your naive approach

It's unlikely that your approach can be made faster than memcpy().

edited Jul 16 '17 at 10:53

answered Jul 16 '17 at 10:46

user0042

7,917
3
24
39

1

Please delete your answer, and post it as a comment instead. – Khaled.K Jul 16 '17 at 10:47
2

@Khaled.K Why so? My answer well explains the difference. – user0042 Jul 16 '17 at 10:48
so how memcpy create a copy from the str to temp ? – myOwnWays Jul 16 '17 at 10:48
@myOwnWays It uses the size, instead of a sentinel. – user0042 Jul 16 '17 at 10:50
using size change anything? i checked it too !!! nothing change on speed – myOwnWays Jul 16 '17 at 10:51
2

This is not an answer, because what you answered was not the question. – Khaled.K Jul 16 '17 at 10:51
@Khaled.K I added the answer in a conclusive paragraph. – user0042 Jul 16 '17 at 10:54
1

@Khaled.K it does answer the "implementations are same, how can this happen?" question – Ap31 Jul 16 '17 at 11:01
This does not provide an answer to the question. To critique or request clarification from an author, leave a comment below their post. - [From Review](/review/low-quality-posts/16726860) – rethab Jul 16 '17 at 22:06
@rethab Are you _robo reviewing_?? Of course that provides an answer to the question. Read before clicking buttons please! – user0042 Jul 16 '17 at 22:17

m0h4mm4d · Answer 3 · 2017-07-17T04:57:00.187

Seeing you didn't use pointers and comparing what you are doing (strcpy) with memcpy clearly shows that you are a beginner and as already stated by everyone else, it is difficult to outsmart an experienced programmer like those that coded your library.

But I'm gonna give you some hints to optimize your code. I took a quick look at Microsoft's C Standard Library implementation (dubbed C Runtime Library) and they are doing it in assembly which is faster than doing it in C. So that is one point for speed.

In most 32-bit architecture with 32-bit buses, CPU can fetch 32 bits of information from memory in one request to memory (assuming that data is properly aligned), but even if you need 16 bits, or 8 bits, it still needs to make that 1 request. So working with your machine's word size probably gives you some speed up.

Lastly I want to direct your attention to SIMD. If your CPU provides it, you can use it and gain that extra speed. Again MSCRT has some SSE2 optimization options.

In the past from time to time, I had to write code that outperform my library implementation, because I had a specific need or a specific type of data that I could optimize for and while it might have some educational value unless specifically needed, your time is better spent on your actual code than to be spent on re-implementing your library functions.

faster way than memcpy to copy 0-terminated string

3 Answers3

Linked