I took a look at the parts of the code behind memcpy and other functions (memset, memmove, ...) and it seems to be a lot, and a lot of assembly code.
Other stackoverflow questions on this topic mention that a reason for that may be because it contains different code for different CPU architectures.
I have personally written my own memcpy/memset functions with very few lines of C++ code and in 1 million iterations with time measured with chrono, I consistently get better times.
So the question is, why did the programmers not just write the code in C/C++ and let the compiler interpret and optimize it how it thinks is best? Why so much assembly code?