I am using GCC 12.1.0. I tend to always use memcpy() to copy blocks of memory rather than using a hand written loop, since I always read that memcpy() should always provide the most efficient implementation, as a builtin compiler-specific function.
But to my great disappointment, even using -O3
, the following test function:
#include <cstring>
void copy(char *from, char *to, int size)
{
memcpy(to,from,size);
}
always results in a jmp to the library memcpy() function, and NOT to some builtin inlined assembly as one would expect. This is IMHO very inefficient, possibly (to my knowledge) much less efficient than simply using a loop like
while (size--) *(to--) = *(src--);
because a jump to a function is always expensive, in particular when it is about a very simple function which the compiler could efficiently inline!
Before somebody points me on the same topic being already covered here When __builtin_memcpy is replaced with libc's memcpy I must specify that my problem is different. There it was simply replied that the compiler may generate a jmp to memcpy() when the copy size is not known at compiling time. I allow myself to insist that, despite the size is unknown, a jmp is still much less efficient than any inlined code, so I am very puzzled on why GCC insists to use a jmp even with a high optimization setting as -O3
.
Is there any particular flag I must specify expressly for GCC to use the builtin memcpy or however an inlined loop ? Why isn't -O3
enough ?
Edit: I changed the float* to char* in the sample code because one could easily point out that was an error, however that was irrelevant to my question
Edit: I can confirm by inspecting the resulting asm that a jmp is generated to clib's memcpy(). Being working extensively with optimized math functions and benchmarks, I can easily confirm that a jmp to a clib's function is always taxing. I have always been taught there is an overhead involved with function calls.