Why this implementation of strcpy costs more time?

Question

I'm a newcomer to assembly languages. I've written two strcpy implementations using masm; one uses rsi and rdi, and another does not. The latter one costs less time. It seems that it is recommended to use rsi and rdi to copy data, and the latter one has bigger looping part than the former one. But when I measured the performance, the former one costed more time. Why the former one costs more time, and what is the recommended way(recommended instructions or registers) to handle strings in x86-64 assembly?

strcpy using rsi and rdi:

custom_strcpy proc

    mov     rsi,    rdx
    mov     rdi,    rcx
    mov     rax,    rdi

_loop:

    movsb
    mov     r8d,    [rsi]
    cmp     r8d,    0
    jne     _loop

_end:

    mov     byte ptr[rdi],  0
    ret


custom_strcpy endp

strcpy not using rsi and rdi:

custom_strcpy proc

    mov     rax,    rcx

_loop:
    mov     r8b,    byte ptr[rdx]
    mov     byte ptr[rcx],  r8b
    inc     rcx
    inc     rdx
    cmp     r8b,        0
    jne     _loop

ret

custom_strcpy endp

C++ code I used to measure the performance:

#include <iostream>
#include <chrono>
#include <cstring>

#define TIMES 100000000

using namespace std;
using namespace std::chrono;

extern "C" char * custom_strcpy(char * dst, const char * src);

extern "C" void foo()
{
    char src[] = "Hello, world!";
    char dst[sizeof(src)];

    auto start = high_resolution_clock::now();
    for (int i = 0; i < TIMES; i++)
    {
        strcpy(dst, src);
    }
    auto end = high_resolution_clock::now();
    cout << duration_cast<duration<double>>(end - start).count() << endl;

    start = high_resolution_clock::now();
    for (int i = 0; i < TIMES; i++)
    {
        custom_strcpy(dst, src);
    }
    end = high_resolution_clock::now();
    cout << duration_cast<duration<double>>(end - start).count() << endl;
}

Some instructions are slower than others. e.g. `movsb` is 5 uops, with throughput = one per 4 clock cycles, on Skylake. http://agner.org/optimize/ and other performance links in [the x86 tag wiki](https://stackoverflow.com/tags/x86/info). Your second loop is all simple instructions that are 1 uop. Also, your `movsb` version has to reload from memory to check for `0`, instead of just checking a value still in a register. — Peter Cordes, May 19 '18 at 04:59
And BTW, your 2nd loop still won't quite run at 1 byte per cycle on Intel CPUs, but with SSE2 or AVX2 `strcpy` can go at 16 or 32 bytes per cycle. (Using `pcmpeqb` / `pmovmskb` to check for `0` bytes in 16 bytes in parallel.) See glibc's actual SSE2 implementation (AT&T syntax, sorry, and very complicated, but the main loop is at `L(Unaligned64Loop_start):` and loads+checks a whole cache line (4 vectors) for zeros, while storing the cache line loaded the previous iteration.) So to answer your question, **SSE2 SIMD is the recommended way to handle strings in x86-64**. — Peter Cordes, May 19 '18 at 05:13
BTW, this would be a better question if you included the actual performance numbers, and what hardware you tested on. (But `movs` is slow on all modern CPUs, AMD and Intel. [`rep movs` is fast, almost as good as a SIMD loop](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy), but you have to know the length; it implements memcpy, not strcpy.) — Peter Cordes, May 19 '18 at 05:24
*"Why the former one costs more time"* - the performance of code is result of many factors. The amount of instructions is only minor factor, the type of instructions is a bit more important (for example 20 `div r64` instructions will be slower than 2x-4x more `mul r32` instructions), and in loops the dependencies between iterations and data structure in memory will be another major factor. If you would have your string in multi-byte encoding terminated with zero which can be also part of char opcode (not possible in UTF-8), you would need to parse the string properly per char = much slower. — Ped7g, May 19 '18 at 07:11
The fastest (and most correct) code, as always, is the code which doesn't get executed at all, or doesn't even exists, so the recommended way to do anything in assembly is to not do it at all (unless it is absolutely necessary). For example if you copy strings because you are passing them as value-arguments into many subroutines, you can avoid that by using only single read-only copy of string and passing pointer to it instead, (eventually "copy on write" if some subroutine is supposed to also modify and pass modified string further down), etc. — Ped7g, May 19 '18 at 07:19
@PeterCordes 's answer is what I exactly wanted. I appreciate for the reply and the information you provided. — paxbun, May 19 '18 at 12:48
Also Thanks to @Ped7g for spending time to reply to my question. — paxbun, May 19 '18 at 12:49

Why this implementation of strcpy costs more time?

0 Answers0