Why do memcpy() and other similar functions use assembly?

Question

I took a look at the parts of the code behind memcpy and other functions (memset, memmove, ...) and it seems to be a lot, and a lot of assembly code.

Other stackoverflow questions on this topic mention that a reason for that may be because it contains different code for different CPU architectures.

I have personally written my own memcpy/memset functions with very few lines of C++ code and in 1 million iterations with time measured with chrono, I consistently get better times.

So the question is, why did the programmers not just write the code in C/C++ and let the compiler interpret and optimize it how it thinks is best? Why so much assembly code?

You are talking about the implementation. Please name the OS and the library with version. — Thomas Sablik, Jun 15 '20 at 14:14
It depends on the platform. On some platform hand optimized assembly is better than compiler generated code for special functions such as `memcpy`, `memcmp`, `strcpy` and other similar functions. — Jabberwocky, Jun 15 '20 at 14:14
Fundamentally, because modern processors have instructions that are not accurately reflected by the abstract machine in C++ (or C). — Jonathan Leffler, Jun 15 '20 at 14:14
I find it hard to believe that a trivial implementation of memcpy is performing faster than an tailored optimized and tested by experts implementation. — bolov, Jun 15 '20 at 14:15
Do note getting micro benchmarks correct is very tricky. Often a compiler will see you are doing a memcpy and just optimize to that. — NathanOliver, Jun 15 '20 at 14:15
I used VS 2019 on latest update of Windows 10. the function is from string.h — Hjkl, Jun 15 '20 at 14:20
The key is that standard library *implementations* necessarily are not written to be portable. So they can rely on tricks based on the specific architecture they are implemented for. Unless the compiler is programmed to optimize with those specific tricks, a good way to implement them is with assembly. — François Andrieux, Jun 15 '20 at 14:21
Re “I consistently get better times”: Did you measure with aligned input and output buffers? Did you measure with an aligned input and unaligned output? Did you measure with an unaligned input and an unaligned output? Did you measure with unaligned input and output? Did you measure various combinations of offsets relative to alignment? Did you measure short copies? Did you measure long copies? Did you measure on machines with AVX-512? Did you measure on machines without AVX-512? With AVX-2? Other processor models? — Eric Postpischil, Jun 15 '20 at 14:23
@NathanOliver you mean it will replace my code with just a call to memcpy? I did not think such a thing was possible, optimizing by changing a piece of code into an entirely different function. I have checked in IDA, and there was no such change. Also disabled all optimizations and the difference is still a few milliseconds in favor of plain C++ — Hjkl, Jun 15 '20 at 14:24
@Hjkl Optimization can do literally anything as long as the observed behavior (as defined by the language specifications) is unchanged. If the compiler sees what you do is equivalent to a `memcpy` it very well can change your code to use `memcpy`. See [The as-if rule](https://en.cppreference.com/w/cpp/language/as_if). — François Andrieux, Jun 15 '20 at 14:25
To expand on @EricPostpischil's questions: Did you remember to enable compiler optimisations? Did you measure the variance of measurements? Is the measured difference significant in relation to the variance? Are you sure that you're not including anything extra in the measurement un-equally? — eerorika, Jun 15 '20 at 14:28
@Hjkl C++ has the as-if rule. Basically, as long as the optimization does the same observable effect, the compiler is allowed to do what it wants. Also note that a benchmark with optimizations turned off is generally meaningless. You really should only compare optimized code because of the as-if rule. — NathanOliver, Jun 15 '20 at 14:28
gcc and clang turing a loop into a call to memcpy: https://godbolt.org/z/2SoKCr — bolov, Jun 15 '20 at 14:30
Well I suppose if a compiler were to do that to my piece of code, it would not be quicker since I'd be comparing 2 matching calls. I suppose it might be an isolated result on my specific machine with its current load, current OS version and may just perform much worse on another or every other computer for all I know — Hjkl, Jun 15 '20 at 14:33
I would expect that `memcpy` and the other functions are written in assembly to take advantage of specialized processor instructions, especially the block read and write of memory. — Thomas Matthews, Jun 15 '20 at 17:00

Bathsheba · Answer 1 · 2020-06-15T14:30:58.213

2

It's technically impossible to write memcpy in standard C++ and C as you have to rely on undefined constructs. The same is true for other standard library functions; memset and malloc are two other examples.

But that's not only reason: A C and C++ standard library implementation is, these days, so closely coupled with a particular compiler that the library writers can take all sorts of liberties that you, as a consumer, cannot. isupper, toupper, &c. stand out as good examples where a particular character encoding can be assumed.

Another good reason is that expertly handcrafted assembly can be difficult to beat for performance.

edited Jun 15 '20 at 14:30

answered Jun 15 '20 at 14:15

Bathsheba

231,907
34
361
483

1

How does implementing `memcpy` using `unsigned char` rely on undefined constructs? – Eric Postpischil Jun 15 '20 at 14:33
@EricPostpischil: See https://stackoverflow.com/questions/62329008/is-it-ub-to-access-a-member-by-casting-an-object-pointer-to-char-then-doing. (For other readers, the rules differ for C.) – Bathsheba Jun 15 '20 at 14:40

klutt · Answer 2 · 2020-06-15T14:18:13.827

2

This "It's pointless to rewrite in assembly" is a myth. A more accurate way to express it is that few programmers have the skill required to beat the compiler. But they do exist, and especially among those who develop compilers.

edited Jun 15 '20 at 14:18

answered Jun 15 '20 at 14:16

klutt

30,332
17
55
95

I never said it's pointless, I agree some could optimize their code better than a compiler does, it's just not what my results pointed to – Hjkl Jun 15 '20 at 14:21
@Hjkl Yes, but may people think that it's pointless. It's a common myth. – klutt Jun 15 '20 at 14:22
@Hjkl It is overwhelmingly more likely that there is a problem with the benchmark you used than the hand-optimized `memcpy` implementation that comes with your standard library was slower than your home brew. Please share how you came to these results. – François Andrieux Jun 15 '20 at 14:22
Definitely not pointless, it's just takes a lot of expertise. And folks with that kind of expertise are the ones that are making compiler's optimizer do amazing things. To beat out the compiler for optimization is quite a challenge, but not impossible. In my project, the cost of development of highly optimized routines takes months, compared to writing the same routine in straightforward C++ which takes hours. Is it worth the cost? Yes, sometimes. – Eljay Jun 15 '20 at 15:07

score 1 · Answer 3 · answered Jun 15 '20 at 14:34

Compiler usually generates some unnecessary code (compared to hand written assembly) even on full optimization level. This wastes memory space which is not good specially on embedded systems and reduces performance.
Are you sure your custom codes are complete and flawless? I don't think so; because when you are writing assembly, you have full control on everything, but when you compile a code, there is a possibility that compiler generates something that you don't want (and it's your fault, not compiler).
It's almost impossible for compiler to generate code which is as complete as hand written assembly and is smaller than it at the same time.
As mentioned in some comments, it also depends on platform.

score 1 · Answer 4 · answered Jun 15 '20 at 16:50

The memcpy and memset as well as other function, are written in assembly to take advantage of processor specific instructions.

For example, the ARM processor has a function that can load multiple registers from successive locations with one instruction. There is also the store multiple instruction that stores multiple registers into successive locations. The Intel x86 has block read and write instructions.

The assembly language allows for copying 4 8-bit bytes using a single 32-bit register.

Some processors allow for conditional execution of instructions, which helps when rolling out loops.

I've written optimized memcpy and memset functions for various processors. I've also spent a lot of time arguing (discussing) C and C++ "best" implementations with compilers. It's a little difficult using C or C++ to try and get the compiler to use the processor instructions you want it to.

score 0 · Answer 5 · answered Jun 15 '20 at 14:17

Why did the programmers not just write the code in C/C++

We aren't mind readers. We don't even know what they wrote. If you need an authoritative answer, then you should ask the programmers that wrote the code.

But we can hypothesise, that they wrote what they did because it was fast, and did the right thing.

Why do memcpy() and other similar functions use assembly?

5 Answers5