-1

what is the most efficient way for the cpu (benchmark way) to copy a string?

I am new to c and i am currently copying a string like this

    char a[]="copy me";
    char b[sizeof(a)];
    for (size_t i = 0; i < sizeof(a); i++) {
        b[i] = a[i];
    }
    printf("%s",b); // copy me

Here is another alternative , a while loop is a little bit faster than a for loop (of what i have heard)

 char a[]="copy me";
 char b[sizeof(a)];
 char c[sizeof(a)];
    
void copyAString (char *s, char *t)
{
    while ( (*s++ = *t++) != '\0');
};

copyAString(b,a);

printf("%s",c);
dn70a
  • 73
  • 2
  • 8
  • 2
    for a compile-time-constant size, almost always `memcpy`. Compilers will inline it when called with a small fixed size. Of course, optimizing compilers will also recognize this copy loop and replace it with an actual call to memcpy or an inline expansion of it, regardless of how you do the array indexing. This example is too small and too simplistic to actually be usable as a benchmark, though. [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) – Peter Cordes Jan 20 '22 at 22:04
  • 4
    re: your edit: the 2nd way is `strcpy` for an implicit-length string. That's slower because it has to search for the terminating 0 byte, if it wasn't known at compile time after inlining and unrolling the loop. (If you're lucky it will optimize the loop to a call to `strcpy` in libc, which uses hand-written asm to do it efficiently as it goes, especially on ISAs like x86 where SIMD can help.) – Peter Cordes Jan 20 '22 at 22:08
  • @Peter Cordes alright so basically memcpy() is the way to go – dn70a Jan 20 '22 at 22:09
  • 2
    The efficiency of a while loop vs. a for loop is in the "sunk cost" category -- the savings won't vary with the string length. As Peter Cordes said, memcpy() is tough to improve on, but chances are good your compiler will use it where it can (even when you don't call it explicitly). If you do call memcpy() directly, though, make sure to include the null terminator. – mzimmers Jan 20 '22 at 22:09
  • 3
    @mzimmers: A string literal as an array initializer does include a terminating 0 byte. And `sizeof()` is the whole size of the array, including it. So the first example with `char a[]="copy me";` does copy the terminator, just like the strcpy version. – Peter Cordes Jan 20 '22 at 22:10
  • 1
    @PeterCordes: no argument. I wasn't referring to this example particularly; just something that one needs to remember when using memcpy on strings. – mzimmers Jan 20 '22 at 22:12
  • Closed this as a duplicate of several existing questions, because it's just re-asking the same things those are answering. Speed depends on how it compiles to asm, not what the source looks like. (Although it can matter a lot whether the logic is identical for two ways of writing things. e.g. some compilers fail to take advantage of `int` signed-overflow being UB to optimize array-index loops into asm using pointers. But modern compilers do: [Efficiency: arrays vs pointers](https://stackoverflow.com/q/2305770)). If you have a much more specific question, feel free to ask it. – Peter Cordes Jan 20 '22 at 22:16
  • @ Jerry Jeremiah - wow great link thank you Jerry – dn70a Jan 20 '22 at 22:42
  • 1
    @PeterCordes `memcpy` is useless if you want to copy a string as you do not know the string size. – 0___________ Jan 20 '22 at 22:53
  • @JerryJeremiah while loop copies the null character and it essential for the string copy. If you do not copy it the destination string will not have null character at index 0 and any use of it as the C string will invoke UB. In string copy **always** at least one character has to be copied. – 0___________ Jan 20 '22 at 22:57
  • @JerryJeremiah the `for` loop function is invalid!!!! – 0___________ Jan 20 '22 at 22:58
  • @dn70a do not look at `for` loop `JerryJeremiah's` function as it is invalid. I does not copy the string as it does not null character terminate the destination string – 0___________ Jan 20 '22 at 22:59
  • @0___________: Look at the actual code in the question. The first block *does* know the string size, and uses it as a loop bound. If this question is supposed to be about copying a compile-time-constant string somewhere, you definitely know the size. Otherwise, it depends on the use-case. In many you can arrange to know the string size instead of having to re-discover it as you copy, allowing an efficient `memcpy`. – Peter Cordes Jan 21 '22 at 00:22

3 Answers3

3

Don't write your own copy loops when you can use a standard function like memcpy (when the length is known) or strcpy (when it isn't).

Modern compilers treat these as "builtin" functions, so for constant sizes can expand them to a few asm instructions instead of actually setting up a call to the library implementation, which would have to branch on the size and so on. So if you're avoiding memcpy because of the overhead of a library function call for a short copy, don't worry, there won't be one if the length is a compile-time constant.

But even in the unknown / runtime-variable length cases, the library functions will usually be an optimized version hand-written in asm that's much faster (especially for medium to large strings) than anything you can do in pure C, especially for strcpy without undefined behaviour from reading past the end of a buffer.

Your first block of code has a compile-time-constant size (you were able to use sizeof instead of strlen). Your copy loop will actually get recognized by modern compilers as a fixed-size copy, and (if large) turned into an actual call to memcpy, otherwise usually optimized similarly.

It doesn't matter how you do the array indexing; optimizing compilers can see through size_t indices or pointers and make good asm for the target platform. See this and this Q&A for examples of how code actually compiles. Remember that CPUs run asm, not C directly.
This example is too small and too simplistic to actually be usable as a benchmark, though. See Idiomatic way of performance evaluation?


Your 2nd way is equivalent to strcpy for an implicit-length string. That's slower because it has to search for the terminating 0 byte, if it wasn't known at compile time after inlining and unrolling the loop.

Especially if you do it by hand like this for non-constant strings; modern gcc/clang are unable to auto-vectorize loops there the program can't calculate the trip-count ahead of the first iteration. i.e. they fail at search loops like strlen and strcpy.

If you actually just call strcpy(dst, src), the compiler will either expand it inline in some efficient way, or emit an actual call to the library function. The libc function uses hand-written asm to do it efficiently as it goes, especially on ISAs like x86 where SIMD can help. For example for x86-64, glibc's AVX2 version (https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcpy-avx2.S.html) should be able to copy 32 bytes per clock cycle for medium-sized copies with source and destination hot in cache, on mainstream CPUs like Zen2 and Skylake.

It seems modern GCC/clang do not recognize this pattern as strcpy the way they recognize memcpy-equivalent loops, so if you want efficient copying for unknown-size C strings, you need to use actual strcpy. (Or better, stpcpy to get a pointer to the end, so you know the string length afterwards, allowing you to use explicit-length stuff instead of the next function also having to scan the string for length.)

Writing it yourself with one char at a time will end up using byte load/store instructions, so can go at most 1 byte per clock cycle. (Or close to 2 on Ice Lake, probably bottlenecked on the 5-wide front-end for the load / macro-fused test/jz / store.) So it's a disaster for medium to large copies with runtime-variable source where the compiler can't remove the loop.

(https://agner.org/optimize/ for performance of x86 CPUs. Other architectures are broadly similar, except for how useful SIMD is for strcpy. ISAs without x86's efficient SIMD->integer ability to branch on SIMD compare results may need to use general-purpose integer bithacks like in Why does glibc's strlen need to be so complicated to run quickly? - but note that's glibc's portable C fallback, only used on a few platforms where nobody's written hand-tuned asm.)

@0___________ claims their unrolled char-at-a-time loop is faster than glibc strcpy for strings of 1024 chars, but that's implausible and probably the result of faulty benchmark methodology. (Like compiler optimization defeating the benchmark, or page fault overhead or lazy dynamic linking for libc strcpy.)


Related Q&As:

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • this is by far the best explanation i have seen so far – dn70a Jan 21 '22 at 21:29
  • 1
    @dn07a you should look up memmove https://stackoverflow.com/questions/28623895/why-is-memmove-faster-than-memcpy –  Jan 21 '22 at 21:32
  • @dinolin: You mean as an example of how cache effects for the source and destination are highly relevant to micro-benchmarking this? [Mats' answer](https://stackoverflow.com/questions/28623895/why-is-memmove-faster-than-memcpy/28624936#28624936) on that Q&A explains why that specific microbenchmark finds memmove faster, because it's overlapping src and dst, while the memcpy test wasn't. Or something like that; I didn't dig up the full details, but looks a lot a case of [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) - cache locality matters. – Peter Cordes Jan 21 '22 at 21:38
  • 2
    @PeterCordes i am not an expert in this topic but what i know is that memcpy is not the fastest way, especially if you call it many times `memmove()` is similar to `memcpy()` as it also copies data from a source to destination. `memcpy()` leads to problems when source and destination addresses overlap as `memcpy()` simply copies data one by one from one location to another. –  Jan 21 '22 at 21:50
  • 1
    true that memcpy() is going to be faster than strcpy() when copying the same number of bytes. The only time strcpy() or any of its “safe” equivalents would outperform memcpy() would be when the maximum allowable size of a string would be much greater than its actual size. –  Jan 21 '22 at 21:50
  • 1
    @dinolin: If `memcpy` isn't the fastest way to copy N bytes between non-overlapping buffers, your C implementation has a performance bug. (Or you've found a case where your libc `memcpy` was tuned to favour a case other than the one you're using. e.g. spending a lot of time branching to optimally handle very large and/or misaligned copies, but you're only using it for small aligned copies so something simpler is fine and does less work before actually copying). – Peter Cordes Jan 21 '22 at 22:09
  • @dinolin: But anyway, `memmove` shouldn't ever be faster than `memcpy`; it does the same thing but with extra checking. If it is, then the compiler should just use memmove instead of memcpy, since it works in a superset of cases where memcpy is safe. (For example, glibc used to do this: memcpy was a synonym for memmove. So it did the extra work to check for "correct" handling of overlap even if you called it as `memcpy`. When glibc changed this, some buggy code that depended on memcpy for overlapping copies broke: https://www.win.tue.nl/~aeb/linux/misc/gcc-semibug.html) – Peter Cordes Jan 21 '22 at 22:12
  • 1
    @dinolin: When you're talking about `strcpy` beating `memcpy` for large buffers, you're talking about a naive / brute-force use of `memcpy(dst, src, 1024)` that always copies all say 1024 bytes of a `char src[1024]` input buffer, instead of just up to the terminating zero? Yeah, of course if you're doing different amounts of copying. But that would be silly; you'd use memcpy if you used a function like `ssize_t size = read(fd, buf, bufsiz);` so you already had the actual size in a variable. (Although yes, for small sizes like 32 or 64 bytes, brute-force copy everything is actually good.) – Peter Cordes Jan 21 '22 at 22:16
1

This probably won't fit your use-case, but I found this code to be VASTLY faster than memcpy when I copy an image-array (and I'm talking >10fold). There are probably a lot of people out there who will benefit from this, so I'm posting it here:

void fastMemcpy(void* Dest, void* Source, unsigned int nBytes)
{
    assert(nBytes % 32 == 0);
    assert((intptr_t(Dest) & 31) == 0);
    assert((intptr_t(Source) & 31) == 0);
    const __m256i* pSrc = reinterpret_cast<const __m256i*>(Source);
    __m256i* pDest = reinterpret_cast<__m256i*>(Dest);
    int64_t nVects = nBytes / sizeof(*pSrc);
    for (; nVects > 0; nVects--, pSrc++, pDest++)
    {
        const __m256i loaded = _mm256_stream_load_si256(pSrc);
        _mm256_stream_si256(pDest, loaded);
    }
    _mm_sfence();
}

This makes use of intrinsics, so include <intrin.h>. The stream-commands bypass the CPUs cache and seem to make a big difference in speed. For bigger arrays you can also use multiple threads, which improve performance further.

Paul Aner
  • 361
  • 1
  • 8
  • This is *only* good for large amounts of data, at least several KiB, and only if you're not going to read it again before it would have been evicted from cache anyway. (So more usually used for data at least as big as L3 cache, although it could make sense for multiple smaller copies that you aren't going to re-read either. Avoiding evicting *other* stuff is also valuable if you aren't going to re-read any time soon; that's what the Non-Temporal NT hint means.) If you did this for a small copy you *are* going to re-read it, you'd force that reader to miss in cache. – Peter Cordes Jan 24 '22 at 20:47
  • What hardware did you test on? I'm surprised this is 10x faster than memcpy. 3x sounds plausible if you're lucky, but 10x sounds more like measurement / experimental error. See [Enhanced REP MOVSB for memcpy](https://stackoverflow.com/q/43343231) re: no-RFO cache-write protocol (like NT stores use as well), and how the difference between Intel server chips (higher latency interconnect) makes a difference here. – Peter Cordes Jan 24 '22 at 20:51
  • Multi-threading memcpy on a "client" chip (like Skylake desktop or laptop) makes nearly no diff; a single core can nearly saturate DRAM controllers. But that's *very* different on big Xeons, especially Skylake and later with their higher latency mesh interconnect and lower single-threaded max bandwidth, where aggregate B/W scales with # of cores (or to put another way, you need all the cores to max out DRAM). [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) – Peter Cordes Jan 24 '22 at 20:53
  • I have an Intel Core i5 8259u and I'm copying an image (for OpenGL) with size 1600x1000x4 floats. I don't have exact timings. I used this in a Mandelbrot-calculation via Compute Shader in OpenGL. The time for the calculations (including copy) went down from 80ms to 10ms - just using fastMemCpy. So the Compute Shader takes something <10ms. That means memcpy took something >70ms and so fastMemCpy <10ms. I doubt that the Compute Shader is done in less than 3ms. But that would make 70/(10 - 3)ms so 10fold increase. – Paul Aner Jan 24 '22 at 21:17
  • Are you copying to or from device (video) memory *directly* with this? How bad is the standard-library memcpy? I guess you're on Windows with MSVC because you used `intrin.h` instead of the portable Intel header name `immintrin.h`, but I'd be surprised if their standard memcpy was too terrible for normal mem-mem copies. If you microbenchmark *just* this or memcpy in a loop, after warm-up to factor out page faults, this is probably the same speed, or a bit faster (like 1.5x or 2x) if MSVC's `memcpy` doesn't use NT stores for large copies. – Peter Cordes Jan 24 '22 at 21:24
  • As for multi-threading, I didn't test that (the code was fast enough for me). But the guy who wrote this code claimed (I think) a 2,3x increase with multithreading. As for cache-eviction - I guess you're right. But in my case, 1600x1000x4 floats make ~12,8MB, so it wouldn't fit anyway. It might be worth a test with a smaller image to see if the non-streaming versions are faster here, though... – Paul Aner Jan 24 '22 at 21:25
  • In this case from device (integrated graphics). I don't have the link to the original article right now. I think the guy claimed something 3-4x and then another 2.3 for multi-threading. And he said, the stream-versions of intrinsics make this substantially faster (just the copying itself)... – Paul Aner Jan 24 '22 at 21:29
  • Copying to OpenGL may be different from normal copies even for small sizes; if the destination memory has WC (write-combining) memory-type instead of the normal WB (write-back) cacheability attribute, that makes alignment and 256-bit stores *much* more important. And if the data is going to be read by GPU or iGPU, not this or another CPU core, again that's different. (I don't know a lot about OpenGL performance details, and whether the driver is likely to map WC device memory directly into your process) – Peter Cordes Jan 24 '22 at 21:32
-1

Generally the most efficient way of copying the string is to manually unroll the loop to minimize the number of operations needed.

Example:

char *mystrcpy(char *restrict dest, const char * restrict src)
{
    char *saveddest = dest;

    while(1)
    {
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
        if(!(*dest++ = *src++)) break;
    }
    return saveddest;
}

https://godbolt.org/z/q3vYeWzab

A very similar approach is used by the glibc implementation.

0___________
  • 60,014
  • 4
  • 34
  • 74
  • This won't be faster than calling the actual `strcpy`! And no, the glibc pure-C fallback implementation doesn't do this, it uses something similar to [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895) on platforms where there isn't a hand-written asm implementation available. (e.g. on x86 that allows checking 16 bytes in parallel, once reaching an alignment boundary.) Oh, actually it seems the portable-fallback strcpy (https://code.woboq.org/userspace/glibc/string/strcpy.c.html) is just `memcpy(dst,src, strlen(src)+1);` – Peter Cordes Jan 21 '22 at 00:16
  • 1
    This answer hardly seemed worth reopening the question for and discarding the list of duplicates I had found (https://stackoverflow.com/posts/70793972/revisions); the question doesn't specify avoiding existing library functions (which get treated as builtins by compilers), so anything you can do by hand in pure C is going to be worse on most targets. At best equal, or in case of a missed optimization bug possibly a bit better on some slow platform that can't copy more than a byte at once. – Peter Cordes Jan 21 '22 at 00:20
  • 1
    Generally using obsolete compilers? Maybe. Generally on modern compilers? Not at all. A modern compiler's optimizer actually recognizes similar code as an unrolled copy, and replaces it by memcpy or strcpy as appropriate. So the whole unrolling thing is pointless. You're just making life hard for the optimizer. On modern architectures, if you translated the unrolled code literally to assembly, it'd be a pessimization: it'd perform much worse than the architecture can do it. – Kuba hasn't forgotten Monica Jan 21 '22 at 00:28
  • @Kubahasn'tforgottenMonica please show me example when this code will be replaced with `memcpy`. I bet that you will not be able to find any – 0___________ Jan 21 '22 at 00:47
  • 2
    @PeterCordes `memcpy(a,b,strlen(b))` definitely will not be efficient. 2. see the assembly implementations - then comment. 3. Most the dupes were useless when the len of string was known compile time etc. 4. `strlen` is a completely different story - we discuss `strcpy`. 5. First proof and check your claims then DV. – 0___________ Jan 21 '22 at 00:53
  • 1
    @0___________ If you're talking about the asm that glibc actually uses on mainstream platforms, yes, of course it's not as efficient. That's why it's only the portable fallback, used on only a few platforms (like maybe MIPS), not on x86 where it uses https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/strcpy-avx2.S.html if AVX2 is available. (dispatched by https://code.woboq.org/userspace/glibc/sysdeps/x86_64/multiarch/ifunc-strcpy.h.html). – Peter Cordes Jan 21 '22 at 01:23
  • 4. `strlen` is almost exactly the same problem as `strcpy`; the hard part is loading a chunk and checking it for containing any zero bytes. Storing it when you find it doesn't have any is a minor difference. You can't invent stores or store past the end of the destination buffer, so you do need to actually copy the final few bytes in the last chunk instead of just find the zero position, but that's minor. – Peter Cordes Jan 21 '22 at 01:27
  • 5. For tiny fixed strings (especially exactly 8 bytes), it's a tossup after inlining. Which GCC chooses to do if we make yours `static inline` (otherwise it actually calls it with runtime-variable args, which is clearly way worse). https://godbolt.org/z/Wcczf6nnb. On some targets, GCC copies from .rodata for one and uses immediates for the other, but vice versa on other targets. It certainly doesn't recognize it as an strcpy loop and replace it, though, so @Kubahasn'tforgottenMonica's comment is overly optimistic about compilers. – Peter Cordes Jan 21 '22 at 01:37
  • For runtime-variable strings that aren't tiny, glibc hand-written asm is clearly going to do better than the scalar loop asm gcc/clang create from your source. On Zen, and Intel before Ice Lake, store throughput is at best 1/clock for any width. So your way is at best copying 1 byte per cycle, or 4GB/s on a 4GHz CPU. That's super slow; strcpy is way faster than that, close to memcpy speed; for strings of a few hundred bytes or more it's probably copying 16 or 32 bytes per cycle. You might invent a use-case where this code wins, e.g. a 3-byte string or something where it can't / doesn't inline – Peter Cordes Jan 21 '22 at 01:42
  • @PeterCordes If someone is using naive method it does not matter that it is a good method. You did not proof that my answer is not efficient only express your frustration because I have reopened the question – 0___________ Jan 21 '22 at 10:53
  • I've proved it can't run faster than 1 byte per cycle on mainstream CPUs, unless compilers auto-vectorize it. But gcc/clang won't because they can only auto-vectorize loops when the trip-count can be calculated before entering the loop. (i.e. not for data-dependent loop-exit conditions like strlen or strcpy.) After a little bit of startup overhead, glibc's AVX2 `strcpy` runs way faster than that, close to 32 bytes per cycle if data is hot in L1d cache. The only time this doesn't suck is when it optimizes away for constant strings, or for very short strings, like length 4 or less. – Peter Cordes Jan 21 '22 at 16:46
  • @PeterCordes I do not see the proof. OP was asking for a portable C solution. AVX is not. BTW very often vector version are slower than not vector one. – 0___________ Jan 21 '22 at 16:49
  • Then you forgot to look at the previous comment combined with my last one. Mainstream CPUs can do at most 2 stores per clock cycle (Intel Ice Lake), with most only doing 1/clock. Since the asm you get from compiling this stores each byte separately, then at best it can sustain 1 or maybe 2 stores per clock. It's well known that glibc strcpy, and any decent implementation in asm for modern x86, will go much faster than that, after a bit of startup overhead, using 32-byte copies until it gets to the chunk containing the 0 byte. See glibc's AVX2 version I linked earlier. – Peter Cordes Jan 21 '22 at 16:54
  • Thus glibc is faster for len > some threshold, maybe between 6 and 12. – Peter Cordes Jan 21 '22 at 16:55
  • @PeterCordes where is the proof? Only your words. – 0___________ Jan 21 '22 at 16:55
  • 1
    You're the one proposing this new implementation for strcpy. Benchmark it yourself and show if / when it's faster than writing `strcpy(dst, src)`. I know it's only going to be for very short and/or constant strings, and showed as much. If you aren't going to believe the facts I cite, then there's no point me saying anything more. – Peter Cordes Jan 21 '22 at 16:58
  • 2
    @PeterCordes You avoid the topic. I never said the it will be faster than `strcpy`. The question was how to make the most efficient **own** implementation of `strcpy` in C language. You revenge DV for reopening your close. – 0___________ Jan 21 '22 at 17:04
  • 3
    That's not what the the question says, and doesn't seem to be how it was intended. Notice [the OP's comment](https://stackoverflow.com/questions/70793972/c-what-is-the-most-efficient-way-to-copying-a-string#comment125153431_70793972) *alright so basically memcpy() is the way to go*. They seem perfectly fine with standard library functions, which can take advantage of HW features specific to that platform. – Peter Cordes Jan 21 '22 at 17:28
  • 1
    If for some reason you can't do that, and you want to limit yourself to only portable ISO C that's strict-aliasing safe, then you're screwed, but this is probably about the best you can do. It's ok for very short strings, but for longer strings it falls far short of what you can do even with a scalar bithack to check chunks for a `0` byte. – Peter Cordes Jan 21 '22 at 17:29
  • 1
    Actually, you can use `memcpy(my_ulong, src, sizeof(my_ulong))` as a strict-aliasing-safe load, assuming it inlines, and use a bithack like in [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/q/57650895) (but with a store for chunks with no terminator). But that would still involve potentially reading past the end of the source string. Only within an aligned 4 or 8-byte chunk if you do it right, so [it can't fault on normal CPUs](https://stackoverflow.com/q/37800739), but is UB and could fault with byte granularity memory protection. – Peter Cordes Jan 21 '22 at 17:36
  • @PeterCordes tested. My function wins with strcpy :) – 0___________ Jan 21 '22 at 17:43
  • 2
    For what test case, on what system? *You* haven't proved anything either. If you just test a single call to `strcpy` with lazy dynamic linking, you're measuring that as part of the cost of `strcpy`. IDK what other benchmark methodology problems you might have made, or if you're just testing with really short strings. – Peter Cordes Jan 21 '22 at 17:46
  • @PeterCordes Try yourself. BTW -O3 is slower than -Os in this case because it tries to vectorize :D. Ubuntu, GCC 11.x Intel i9 64GB RAM. Strings up to 1024 characters long (did not have enough patience to test longer). Smaller strings up to 10 times faster than `strcpy` – 0___________ Jan 21 '22 at 17:48
  • I'm a bit curious what code you used to test this. Unless something inlines and optimizes away, or you're measuring page faults or other startup overhead for `strcpy`, there's no way this would be faster than glibc strcpy for size 1024. Anything over size 64 is just not plausible at all for this to be faster, except as a result of a mistake in microbenchmarking method that doesn't reflect the cost for a real program doing many small calls over its run-time. – Peter Cordes Jan 21 '22 at 21:33