Why are complicated memcpy/memset superior?

Question

When debugging, I frequently stepped into the handwritten assembly implementation of memcpy and memset. These are usually implemented using streaming instructions if available, loop unrolled, alignment optimized, etc... I also recently encountered this 'bug' due to memcpy optimization in glibc.

The question is: why can't the hardware manufacturers (Intel, AMD) optimize the specific case of

rep stos

and

rep movs

to be recognized as such, and do the fastest fill and copy as possible on their own architecture?

Cop out answer: Because they just don't, and as a result, nobody writes code like that, and therefore there's no reason for them to do so... (cycle continues) — Billy ONeal, Jan 13 '12 at 23:50
@BillyONeal: I don't think so. For every new feature they add, there's still no code written that uses it. — Yakov Galka, Jan 13 '12 at 23:57
Yes, but when a new feature is added, it gets added to show better performance in one area or another. Optimizing this doesn't make sense for CPU vendors because compilers won't emit code like it. — Billy ONeal, Jan 14 '12 at 00:06
@BillyONeal: in fact they do. The intrinsic version of memcpy on MSVC expands to exactly this code. — Yakov Galka, Jan 14 '12 at 00:09
CPU manufacturers started making a move towards a more RISC architecture. For that reason, software moves with it. I can't tell you why they made the move in that direction. — Mike Kwan, Jan 14 '12 at 00:09
@ybungalobill: MSVC has different intrinsic versions depending on the usage (e.g. is the size a constant multiple of 16 bytes) and the compile options. — Guy Sirton, Jan 14 '12 at 00:20
It doesn't matter much these days, the memory bus bandwidth is the real throttle. — Hans Passant, Jan 14 '12 at 00:23
@ybungalobill: The question title doesn't exactly match what you're asking in the body... Are you asking why are the complicated implementations superior or why can't chip designers make the string instructions work as fast the complicated implementations? Thesre are related but are slightly different questions. — Guy Sirton, Jan 14 '12 at 00:30
@HansPassant: I think the streaming loads/stores and prefetch instructions are still useful as they allow better control over the cache. Because of the memory bandwidth being a bottleneck you want to have control over what is in the cache to avoid needing to go out again to memory... So you may want to copy without polluting the cache. — Guy Sirton, Jan 14 '12 at 00:33
@GuySirton: the problem here is that when the CPU encounters a `rep stos` it *exactly knows what memory will be affected and what the side effects will be*. It can do a 100% correct prediction. SIMD is more general, and thus such prediction can't be done. For example when the `memcpy` enters the internal loop, the CPU can't know for sure that it will need the block of exactly N bytes starting at P, accessed sequentially once and only once, and nothing more. — Yakov Galka, Jan 14 '12 at 00:41
Intel Ivybridge and later *do* have optimized `rep stos` and `rep movs`. Intel calls this "Fast String" support, or ERMSB (Enhanced Rep Movs/StosB). See Intel's optimization manual (linked from http://stackoverflow.com/tags/x86/info) for more info. This is still a good question, of why previous CPUs have microcoded implementations of `rep movs` take weren't as fast as an SSE loop doing 16B loads/stores. — Peter Cordes, Nov 04 '15 at 22:18

Dietrich Epp · Answer 1 · 2021-09-23T08:44:57.733

Cost.

(Note that memcpy is coming to ARM, see below.)

The cost of optimizing memcpy in your C library is fairly minimal, maybe a few weeks of developer time here and there. You'll have to make a new version every several years or so when processor features change enough to warrant a rewrite. For example, GNU's glibc and Apple's libSystem both have a memcpy which is specifically optimized for SSE3.

The cost of optimizing in hardware is much higher. Not only is it more expensive in terms of developer costs (designing a CPU is vastly more difficult than writing user-space assembly code), but it would increase the transistor count of the processor. That could have a number of negative effects:

Increased power consumption
Increased unit cost
Increased latency for certain CPU subsystems
Lower maximum clock speed

In theory, it could have an overall negative impact on both performance and unit cost.

Maxim: Don't do it in hardware if the software solution is good enough.

But, memcpy is coming to ARM. As processors get larger and larger, the incremental cost of adding more instructions gets lower and lower, relative to the existing cost of a core. From Arm A-Profile Architecture Developments 2021:

To address these concerns the 2021 extensions introduce new instructions specifically targeting the memcpy() and memset() family of functions.

The key concerns mentioned are that the complicated software implementations of memcpy, while fast, may need to be rewritten for different microarchitectures in order to get better performance. They also need to account for alignment and size in different ways. Having a fast hardware implementation means that memcpy can be inlined and get good performance across different microarchitectures.

Note: The bug you've cited is not really a bug in glibc w.r.t. the C specification. It's more complicated. Basically, the glibc folks say that memcpy behaves exactly as advertised in the standard, and some other folks are complaining that memcpy should be aliased to memmove.

Time for a story: It reminds me of a complaint that a Mac game developer had when he ran his game on a 603 processor instead of a 601 (this is from the 1990s). The 601 had hardware support for unaligned loads and stores with minimal performance penalty. The 603 simply generated an exception; by offloading to the kernel I imagine the load/store unit could be made much simpler, possibly making the processor faster and cheaper in the process. The Mac OS nanokernel handled the exception by performing the required load/store operation and returning control to the process.

But this developer had a custom blitting routine to write pixels to the screen which did unaligned loads and stores. Game performance was fine on the 601 but abominable on the 603. Most other developers didn't notice if they used Apple's blitting function, since Apple could just reimplement it for newer processors.

The moral of the story is that better performance comes both from software and hardware improvements.

In general, the trend seems to be in the opposite direction from the kind of hardware optimizations mentioned. While in x86 it's easy to write memcpy in assembly, some newer architectures offload even more work to software. Of particular note are the VLIW architectures: Intel IA64 (Itanium), the TI TMS320C64x DSPs, and the Transmeta Efficeon are examples. With VLIW, assembly programming gets much more complicated: you have to explicitly select which execution units get which commands and which commands can be done at the same time, something that a modern x86 will do for you (unless it's an Atom). So writing memcpy suddenly gets much, much harder.

These architectural tricks allow you to cut a huge chunk of hardware out of your microprocessors while retaining the performance benefits of a superscalar design. Imagine having a chip with a footprint closer to an Atom but performance closer to a Xeon. I suspect the difficulty of programming these devices are is the major factor impeding wider adoption.

Good answer. Or how I'd summarize it: "Don't do things in hardware, if you can do them just as efficiently in software". About the glibc thingy: Torvalds is clearly the practical one here - users don't care why something is not working. And I somehow doubt that `memmove` has a noticeable performance hit when compared to `memcpy` these days.. will have to test that. — Voo, Jan 14 '12 at 00:41
@ybungalobill: I was simply adding context, since even though you might know the story of glibc `memcpy`, visitors to the site might not. — Dietrich Epp, Jan 14 '12 at 00:45
@Voo: Note that Adobe fixed the flash plugin and very few people seem to mind these days that `memcpy` still isn't `memmove`. I'm not trying to take sides. — Dietrich Epp, Jan 14 '12 at 00:48
@DietrichEpp: you may want to rewrite the answer. On the one hand you say that ARM adds an optimized `memcpy`, on the other that 'the trend is in the opposite direction'. Itanium is long dead, and @PhiS answer tells that Intel indeed implemented this optimization a long time ago. To me it seems that everybody accepted that CPU support for fast `memcpy/memset` is a desired feature, and that `memcpy` implementations were lagging behind the hardware due to the added complexity of choosing the right path based on the runtime architecture. — Yakov Galka, Sep 30 '21 at 14:01
@YakovGalka: I've already incorporated that into the answer, and explained the reasoning. I generally do rewrites for clarity once an answer has 100 upvotes, as a rule of thumb--if it's reaching that many people, then rewriting it is worth the effort. This answer isn't reaching so many people--only 31 total votes (up and down) over 8 years. — Dietrich Epp, Sep 30 '21 at 17:35

score 21 · Accepted Answer · answered Feb 07 '12 at 13:47

21

One thing I'd like to add to the other answers is that rep movs is not actually slow on all modern processors. For instance,

Usually, the REP MOVS instruction has a large overhead for choosing and setting up the right method. Therefore, it is not optimal for small blocks of data. For large blocks of data, it may be quite efficient when certain conditions for alignment etc. are met. These conditions depend on the specific CPU (see page 143). On Intel Nehalem and Sandy Bridge processors, this is the fastest method for moving large blocks of data, even if the data are unaligned.

[Highlighting is mine.] Reference: Agner Fog, Optimizing subroutines in assembly language An optimization guide for x86 platforms. ,p. 156 (and see also section 16.10, p. 143) [version of 2011-06-08].

answered Feb 07 '12 at 13:47

PhiS

4,540
25
35

10

REP MOVS uses a cache protocol feature that is not available to regular code. Basically like SSE streaming stores, but in a manner that is compatible with normal memory ordering rules, etc. // The "large overhead for choosing and setting up the right method" is mainly due to the lack of microcode branch prediction. I have long wished that I had implemented REP MOVS using a hardware state machine rather than microcode, which could have completely eliminated the overhead. – Krazy Glew Sep 04 '13 at 17:11
By the way, I have long said that one of the things that hardware can do better/faster than software is complex multiway branches. – Krazy Glew Oct 08 '13 at 19:43
@KrazyGlew: I think you're talking about the IvB and later Fast String ops feature, aka ERMSB (Enhanced Rep Movs/StosB). Even then, there's a feature bit you can disable in case you need to run binaries that depend on the ordering of writes within a single `rep movs`, instead of the now-recommended method of doing a separate store to a synchronization flag after a memset/memcpy. – Peter Cordes Nov 04 '15 at 22:09
6

@PeterCordes: intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996, which I supervised. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a no-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in iVB. – Krazy Glew Nov 10 '15 at 14:47
Btw: I supervised fast strings on P6, certainly not the entire P6. – Krazy Glew Nov 10 '15 at 14:56
4

Rewriting a comment that had an auto-correct-typo: @PeterCordes: the big weakness of doing fast strings in microcode was (a) microcode branch mispredictions, and (b) the microcode fell out of tune with every generation, getting slower and slower until somebody got around to fixing it. Just like a library men copy falls out of tune. I suppose that it is possible that one of the missed opportunities was to use 128-bit loads and stores when they became available, and so on. – Krazy Glew Nov 10 '15 at 20:14
@KrazyGlew: tyvm for that insight into what happens under the hood. I hadn't realized pre-IvB string ops avoided RFO. Also interesting to hear that microcoded instructions don't always get tweaked for each CPU generation. I'm not a hardware designer myself, but I casually follow it the way some people follow sports teams. I've seen your name come up on realworldtech.com. (reposting to re-order comments into correct order after your repost. I did eventually deciphered that autocorrect, but it was confusing at first :P) – Peter Cordes Nov 10 '15 at 20:17
3

In retrospect, I should have written a self-tuning infrastructure, to get reasonably good microcode on every generation. But that would not have helped use new, wider, loads and stores, when they became available. // The Linux kernel seems to have such an autotuning infrastructure, that is run on boot. // Overall, however, I advocate hardware state machines that can smoothly transition between modes, without incurring branch mispredictions. // It is debatable whether good microcode branch prediction would obviate this. – Krazy Glew Nov 10 '15 at 20:19
1

@KrazyGlew: It seems strange to me that "rep movs" wouldn't be a constant focus of hardware optimization. Given the amount of other logic on a modern x86 CPU, the amount required to ensure that "rep movs" was never far from being optimal would seem pretty small. If user code wanting a fast memcpy has to lead off with logic to select the optimal approach, it will be difficult for hardware to completely optimize away such tests. If user code can simply use `rep movs_` and let hardware worry about the optimal approach, hardware won't have to optimize away such tests. – supercat Sep 06 '17 at 20:12
@supercat: it is possible that Intel's latest x86 CPUs have made the investment so that REP MOVSx is optimal in all circumstances. But I doubt it. – Krazy Glew Sep 09 '17 at 12:51
@supercat: I think HW state machines are required, possibly at several levels. Ideally, there would be a special bus transaction, "move from A to B", so that data only crosses any particular data bus once - possibly with lookaside caching. I used to have an agenda for block memory operations on a web page - P6 fast strings was just the first few steps. – Krazy Glew Sep 09 '17 at 13:01
1

@KrazyGlew: Even if HW doesn't use the absolutely-optimal sequence of bus operations, I would think it should be possible to keep it *close enough* to the optimal instruction sequence that hardware would support that it would be reasonable for programmers to use "rep movs" across the board. – supercat Sep 10 '17 at 13:45
@supercat: I agree. Otherwise I would not have done "fast strings" in the first place. But it was not maintained for a while. That may have changed - one can only hope, – Krazy Glew Sep 10 '17 at 20:59
@KrazyGlew: I wish you could write all that as an answer from somebody who actually *worked* on it. Other answers here are just speculations. – Yakov Galka Oct 15 '19 at 23:45

Guy Sirton · Answer 3 · 2012-01-14T01:04:34.477

5

General Purpose vs. Specialized

One factor is that those instructions (rep prefix/string instructions) are general purpose, so they'll handle any alignment, any number of bytes or words and they'll have certain behavior relative to the cache and or state of registers etc. i.e. well defined side effects that can't be changed.

The specialized memory copy may only work for certain alignments, sizes, and may have different behavior vs. the cache.

The hand written assembly (either in the libary or one developers may implement themselves) may outpeform the string instruction implementation for the special cases where it is used. Compilers will often have several memcpy implementations for special cases and then the developer may have a "very special" case where they roll their own.

It doesn't make sense to do this specialization at the hardware level. Too much complexity (= cost).

The law of diminishing returns

Another way to think about it is that when new features are introduced, e.g. SSE, the designers make architectural changes to support these features, e.g. a wider or higher bandwidth memory interface, changes to the pipeline, new execution units, etc. The designer is unlikely at this point to go back to the "legacy" portion of the design to try and bring it up to speed to the latest features. That would kind of be counter-productive. If you follow this philosophy you may ask why do we need SIMD in the first place, can't the designer just make the narrow instructions work as fast as SIMD for those cases where someone uses SIMD? The answer is usually that it's not worth it because it is easier to throw in a new execution unit or instructions.

edited Jan 14 '12 at 01:04

answered Jan 14 '12 at 00:16

Guy Sirton

8,331
2
26
36

I have the following problems with this answer: "those instructions are general purpose" so is the `memcpy` in the library. It can work with any alignment, etc... "certain behavior relative to the cache and or state of registers" the semantics of "rep stos" are much simpler than the SIMD based `memcpy`. You know the side effects even before you executed the command, its just `esi += ecx, edi += ecx, ecx = 0`. Nothing else is changed, as opposed to the SIMD version where you actually utilize almost everything on the CPU. – Yakov Galka Jan 14 '12 at 00:29
"The specialized memory copy..." I don't talk about the specialized. "The hand written assembly may outpeform the library implementation ..." what? The library implementation *is* the handwritten. – Yakov Galka Jan 14 '12 at 00:30
@ybungalobill: Some people write their own mem copy functions and then there are specialized library implementations. It wasn't clear to me which ones you were talking about. I'll clarify my answer. – Guy Sirton Jan 14 '12 at 00:36
@ybungalobill: my point is that the side effects and legacy requirements may prevent a high speed implementation. esi += ecx and all the logic around it may make the copy slower vs. instructions that don't require this. SIMD units are actually more of a standalone mini-processor, they have their own registers etc. so they don't interace as closely with the state of the rest of the processor. (processor dependant, YMMV, but SSE and NEON are pretty much like that). – Guy Sirton Jan 14 '12 at 00:46
@GuySirton: I like your point about "(not) going back to the legacy portion of the design". It is very accurate. // It is overall true that a well tuned REP MOVS can beat a well tuned memcopy or memmove - especially if the decision overhead can be reduced. But if only because STA_WC is a feature not available to normal programmers. But... somebody at Intel would have to budget for the effort to tune REP MOVS on every processor generation. Whereas, if you don't retune REP MOVS, then the user has to retune - but later. // i.e. separate memcopy is JIT - don't retune until problem found. – Krazy Glew Sep 04 '13 at 17:16
3

Tune memcopy today, and your tuned version can be deployed to every existing PC in the market. Tune REP MOVS, and it can only be deployed to future CPUs. (Could use microcode patch, but that is mainly used for security holes. Plus, ucode patch has overhead.) – Krazy Glew Sep 04 '13 at 17:19
1

If hardware/firmware tunes REP MOVS, then every app on the new processor that uses REP MOVS will benefit. Whereas any software that has tuned memcopy by itself will not benefit from the tuned memcopy. If such software used "SIMD" loads and stores, ideally 512-bit full cache line memops possibly with vector masks, then it may benefit from future hardware optimization to those. – Krazy Glew Sep 09 '17 at 14:16
If you have a machine with binary translation/optimization, the optimizer may be able to recognize old versions if memcopy and translate them into new versions. – Krazy Glew Sep 09 '17 at 14:18

Bo Persson · Answer 4 · 2012-01-14T22:47:29.357

Once upon a time rep movsb was the optimal solution.

The original IBM PC had an 8088 processor with an 8-bit data bus and no caches. Then the fastest program was generally the one with the fewest number of instruction bytes. Having special instructions helped.

Nowadays, the fastest program is the one that can use as many CPU features as possible in parallel. Strange as it might seem at first, having code with many simple instructions can actually run faster than a single do-it-all instruction.

Intel and AMD keep the old instructions around mainly for backward compatibility.

score 1 · Answer 5 · answered Jan 14 '12 at 01:44

In embedded systems, it's common to have specialized hardware that does memcpy/memset. It's not normally done as a special CPU instruction, rather it's a DMA peripheral that sits on the memory bus. You write a couple of registers to tell it the addresses, and HW does the rest. It doesn't really warrant a special CPU instruction since it's really just a memory interface issue that doesn't really need to involve the CPU.

score 1 · Answer 6 · answered Jan 14 '12 at 03:56

If it aint broke dont fix it. It aint broke.

A primary problem is unaligned accesses. They go from bad to really bad depending on what architecture you are running on. A lot of it has to do with the programmers, some with the compilers.

The cheapest way to fix memcpy is to not use it, keep your data aligned on nice boundaries and use or make an alternate to memcpy that only supports nice aligned, block copies. Even better would be to have a compiler switch to sacrifice program space and ram for the sake of speed. folks or languages that use a lot of structures such that the compiler internally generates calls to memcpy or whatever that language equivalent is would have their structures grow such that there is a pad between or padding inside. A 59 byte structure may become 64 bytes instead. malloc or an alternative that only gives pointers to an address aligned as specified. etc etc.

It is considerably easier to just do all of this yourself. An aligned malloc, structures that are multiples of the alignement size. Your own memcpy that is aligned, etc. with it being that easy why would the hardware folks mess up their designs and compilers and users? there is no business case for it.

Another reason is that caches have changed the picture. your dram is only accessible in a fixed size, 32 bits 64 bits, something like that, any direct accesses smaller than that are a huge performance hit. Put the cache in front of that the performance hit goes way down, any read-modify-write happens in the cache with the modify allowing for mulitple modifies for a single read and write of dram. You still want to reduce the number of memory cycles to the cache, yes, and you can still see the performance gain by smoothing that out with the gear shift thing (8 bit first gear, 16 bit second gear, 32 bit third gear, 64 bit cruising speed, 32 bit shift down, 16 bit shift down, 8 bit shift down)

I cant speak for intel but do know that folks like ARM have done what you are asking a

ldmia r0!,{r2,r3,r4,r5}

for example is still four 32 bit transfers if the core uses a 32 bit interface. but for 64 bit interfaces if aligned on a 64 bit boundry it becomes a 64 bit transfer with a length of two, one set of negotiations between the parties and two 64 bit words move. If not aligned on a 64 bit boundary then it becomes three transfers a single 32 bit, a single 64 bit then a single 32 bit. You have to be careful, if these are hardware registers that may not work depending on the design of the register logic, if it only supports single 32 bit transfers you cant use that instruction against that address space. No clue why you would try something like that anyway.

The last comment is...it hurts when I do this...well dont do that. Dont single step into memory copies. the corollary to that is there is no way anyone would modify the design of the hardware to make single stepping a memory copy easier on the user, that use case is so small it doesnt exist. Take all the computers using that processor running at full speed day and night, measured against all the computers being single stepped through mem copies and other performance optimized code. It is like comparing a grain of sand to the width of the earth. If you are single stepping, you are still going to have to single step through whatever the new solution is if there were one. to avoid huge interrupt latencies the hand tuned memcpy will still start with an if-then-else (if too small of a copy just go into a small set of unrolled code or a byte copy loop) then go into a series of block copies at some optimal speed without horrible latency size. You will still have to single step through that.

to do single stepping debugging you have to compile screwed up, slow, code anyway, the easiest way to solve a single step through memcpy problem, is to have the compiler and linker when told to build for debug, build for and link against a non-optimized memcpy or an alternate non-optimized library in general. gnu/gcc and llvm are open source, you can make them do whatever you want.

Why are complicated memcpy/memset superior?

6 Answers6

Linked

Related