Why is GCC not using builtin memcpy()?

Question

I am using GCC 12.1.0. I tend to always use memcpy() to copy blocks of memory rather than using a hand written loop, since I always read that memcpy() should always provide the most efficient implementation, as a builtin compiler-specific function.

But to my great disappointment, even using -O3, the following test function:

#include <cstring>

void copy(char *from, char *to, int size)
{
  memcpy(to,from,size);
}

always results in a jmp to the library memcpy() function, and NOT to some builtin inlined assembly as one would expect. This is IMHO very inefficient, possibly (to my knowledge) much less efficient than simply using a loop like

while (size--) *(to--) = *(src--);

because a jump to a function is always expensive, in particular when it is about a very simple function which the compiler could efficiently inline!

Before somebody points me on the same topic being already covered here When __builtin_memcpy is replaced with libc's memcpy I must specify that my problem is different. There it was simply replied that the compiler may generate a jmp to memcpy() when the copy size is not known at compiling time. I allow myself to insist that, despite the size is unknown, a jmp is still much less efficient than any inlined code, so I am very puzzled on why GCC insists to use a jmp even with a high optimization setting as -O3.

Is there any particular flag I must specify expressly for GCC to use the builtin memcpy or however an inlined loop ? Why isn't -O3 enough ?

Edit: I changed the float* to char* in the sample code because one could easily point out that was an error, however that was irrelevant to my question

Edit: I can confirm by inspecting the resulting asm that a jmp is generated to clib's memcpy(). Being working extensively with optimized math functions and benchmarks, I can easily confirm that a jmp to a clib's function is always taxing. I have always been taught there is an overhead involved with function calls.

Probably because the compiler does not know the size and alignment at compile time and therefore has to use the more general routine. Try directly calling `std::memcpy` where all arguments are known at compile time and are compile time constants. — Richard Critten, Dec 08 '22 at 12:58
You're making a number of statements of what is "very efficient" or "much less efficient". Do you have evidence through testing or profiling to back up those claims? Or are you simply stating something you believe without evidence? — Peter, Dec 08 '22 at 12:59
Why do you think that calling a function is expensive? And why do you think that a direct C++ implementation will be faster than calling a (possibly optimized) library routine? Did you benchmark anything? — Serge Ballesta, Dec 08 '22 at 12:59
[Pro Tip] Programming is very complex and what optimizes to what is hard for a person to do. Instead of doing that write code that is easy to understand and easy to maintain. Once you done that then check the performance and if it is not to where you like then profile the code to find the actual slow points. This saves you time and gives you easier code to maintain and that's where most programing time is spent anyways. — NathanOliver, Dec 08 '22 at 13:03
at least the comparison with `memcpy` and your handwritten loop, which you claim to be more efficient, should be rather straightforward to carry out. You should really do that, it might already clear up some of your doubts — 463035818_is_not_an_ai, Dec 08 '22 at 13:16
Why do you think an unconditional jump to a known location costs anything at all? — molbdnilo, Dec 08 '22 at 13:20
your linked post seems exactly relevant to me, the size is unknown in your function so the compiler can't know whether calling `memcpy` is going to be more efficient than inlining and unrolling — Alan Birtles, Dec 08 '22 at 13:21
for example your code does get fully inlined when the size is known: https://godbolt.org/z/aEPPKa19f — Alan Birtles, Dec 08 '22 at 13:27
a function call has a cost. Calling a function is more expensive than not calling a function, but as soon as you compare calling a function with doing something else, then in most cases the tiny overhead from calling the function can be neglected. Your whole question seems to hinge on the false premise that a function call would be super expensive and should be avoided at all cost (you even say that **any** inline code would be cheaper than a function call, which clearly cannot be true) — 463035818_is_not_an_ai, Dec 08 '22 at 13:27
@AlanBirtles Your demo is incorrect since you pass the number of elements (32) instead of the number of bytes (128) to `memcpy`. Anyway, the corrected demo shows the application of inlining as well: https://godbolt.org/z/T9oh5v93W. — Daniel Langr, Dec 08 '22 at 13:33
@DanielLangr ah, I'd assumed that a function that took explicitly typed arguments would take a number of elements rather than a size — Alan Birtles, Dec 08 '22 at 13:40
@AlanBirtles You pass the incorrect number to `memcpy`. `memcpy` is a function from the C library, therefore, it doesn't know anything about the types of pointer arguments (it passes them through void pointer parameters): https://en.cppreference.com/w/cpp/string/byte/memcpy. You can easily check that by inspecting you assembly, since there are only 2 loads/stores with 16-byte XMM registers. — Daniel Langr, Dec 08 '22 at 13:42
@DanielLangr yep, i know, i assumed the copy function would deal with that — Alan Birtles, Dec 09 '22 at 07:01
I disagree with the decision of associating my question to a duplicate despite I specified clearly that I was aware of the other apparently relevant question, but that it did NOT cover my case. There is really nothing I can change in my question to highlight the differences, I really think I was clear enough. I deem quite nonsensic that a compiler is not inlining a simple copy loop. Would anybody ever be happy if a compiler made a jump to a hypothetic addf() function everytime he has to sum two floats ? My experience with benchmarking always told me that a jmp *is* taxing in time critic code — elena, Dec 09 '22 at 13:32
I can confirm that in one of our applications, replacing a loop by memcpy made the debug version 10 times faster, the release version 10 times slower! — Wyrzutek, Dec 09 '22 at 14:05
Here is a [quick benchmark](https://quick-bench.com/q/_aCofeCQ3j_0L4Y0rMdZwLM69MU) that shows a call to `memcpy` being 19 times faster than a handwritten loop. You are welcome to modify it as you see fit and post the results as an evidence to your claim. — n. m. could be an AI, Dec 09 '22 at 14:19
@Wyrzutek but did you also check in the resulting asm if memcpy was really implemented as a jump and not as an inlined assembly ? — elena, Dec 10 '22 at 14:26
@n.m. if you are sure about your statement, you would have been welcome to reply "GCC does not inline memcpy() because a call to the clib's memcpy always results faster". Labeling my question as a duplicate is not correct. However I think my doubt is legitimate enough: what 'magic' can actually do the clib's memcpy function internally to result faster than any explicite inlined hand-written copy loop, despite the theorical overhead involved with a jmp ? This is very counter intuitive to me — elena, Dec 10 '22 at 14:32
I don't have an answer to your question. It cannot be answered because it is based on wrong assumptions and as such makes no sense.You are welcome to fix that and ask a question based on facts. The facts so far are: in a specific benchmark, a call to memcpy is about 19 times faster than **a particular naÏve** hand-written inline loop. No more, no less. There is no established fact so far that would involve a phrase like "any inline loop" or anything like that. You have a tool to establish more facts. — n. m. could be an AI, Dec 10 '22 at 16:02
@elena I didn't remember correctly and wrote bs, sorry. Went back checking, it was memcmp and not memcpy (and it was 1.5 times slower, not 10). No idea about the assembly, I suppose memcmp processes larger chunks than a loop, while the loop for copy does not have the luxury of early return. I guess the result also depends if the arrays are on the stack or the heap. — Wyrzutek, Dec 12 '22 at 08:06
_"I allow myself to insist that, despite the size is unknown, a jmp is still much less efficient than any inlined code"_. I disagree with that. In general, jump to a highly tuned and optimized machine code can be in the end faster than compiler-generated machine code. That is the reason why performance-critical library/program parts are commonly written in assembler (such as locks, BLAS routines, etc.). For instance, you can always compile BLAS from the Fortran source code. But you will then very likely end up with a slower program compared to using some highly-tuned implementation. — Daniel Langr, Dec 12 '22 at 10:06

Why is GCC not using builtin memcpy()?

0 Answers0