Optimal way to move memory in x86 and ARM?

Question

I am interested knowing the best approach for bulk memory copies on an x86 architecture. I realize this depends on machine-specific characteristics. The main target is typical desktop machines made in the last 4-5 years.

I know that in the old days MOVSD with REPE was nominally the fastest approach because you could move 4 bytes at a time, but I have read that nowadays MOVSB is just as fast and is simpler to write, so you may as well do a byte move and just forget about the complexities of a 4-byte move.

A surrounding question is whether MOVxx instructions are worth it at all. If the CPU can run so much faster than the memory bus, then maybe it is pointless to use a CISC move and you may as well use a plain MOV. This would be most attractive because then I could use the same algorithms on other processor architectures like ARM. This brings up the analogous question of whether ARM's specialized instructions for bulk memory moves (which are totally different than Intels) are worth it or not.

Note: I have read section 3.7.6 in the Intel Optimization Reference Manual so I am familiar with the basics. I am hoping someone can relate practical experience in the area beyond what is in this manual.

Certainly, you should measure in practice any tip you might get here. One option you should consider is looking at what SSE instructions can offer. Since they can use registers wider than 64 bits, they might be faster than scalar instructions, but that's just a guess. — Michał Kosmulski, Dec 09 '12 at 20:49
@Dems No. There are lot of different ways to move blocks with different results on different kinds of systems. I need somebody with a lot of assembly experience over the last few years to weigh in on with their findings. In particular I don't even know all the different ways you can arrange block moves. That alone could be the subject of an article. That is why I need expert advice. — Tyler Durden, Dec 09 '12 at 20:49
This is something that a library writer worries about, particularly the guy that wrote memcpy(). The more modern the cpu, the less it matters because it is all throttled by the memory bus bandwidth anyway. Avoid reinventing that wheel and making it square, use the library. — Hans Passant, Dec 09 '12 at 21:01
I am not writing in C. I am writing ASM routines. Also, just for the record, memcpy is written in C and is a plain unsigned byte copy. If optimized code is inlined, then it is a function of the compiler, not memcpy. — Tyler Durden, Dec 09 '12 at 21:09
Just because you write in ASM doesn't mean you can't call the C library memcpy. I have seen no modern C library implement memcpy as a plain unsigned byte copy, but I have seen many examples of wheel-reinvention resulting in worse performance than the system libraries would provide. — unixsmurf, Dec 09 '12 at 21:29
@unixsmurf As I said above, optimized memory moves are hard-coded by high-end compilers. Run of the mill memcpy, especially as it exists on unix is poorly optimized and way less efficient than even naive ASM. If you read this article: http://software.intel.com/en-us/articles/memcpy-performance by Intel it makes clear some of the issues involved. — Tyler Durden, Dec 09 '12 at 21:37
And I would then refer to http://sourceware.org/git/?p=glibc.git;a=blob;f=ports/sysdeps/arm/memcpy.S;h=08f7f76ecf192c5e394ec57d9ee0db68aa57cbc0;hb=HEAD, both as a reference of optimized implementations for ARM, and as a means of refuting the article. — unixsmurf, Dec 09 '12 at 21:48
`REP MOVSB` is going to be the fastest way on modern x86, thanks to write combining into SIMD ability of modern CPUs. — Griwes, Dec 09 '12 at 23:51
An interesting experiment done about a year ago, Linux Kernel mailing list, see https://lkml.org/lkml/2011/8/12/267 - the results on x86 are like: small buffer, generic alignment: use `rep movs` (and these, from tracing, are >>90% of all mem-mem transfers); big chunks (here: video frames), other techniques significantly outperform. If you have huge blocks to copy, other techniques make sense. On ARM, a blocked copy loop with prefetching, see http://code.metager.de/source/xref/linux/stable/arch/arm/lib/copy_page.S , outperforms a 32bit-at-a-time tight loop by two orders of magnitude. — FrankH., Dec 10 '12 at 16:02
I would also mention that while glibc's memcpy might not be optimal (though I doubt it's worse then a naive assembly implemention), it's not like there aren't any more optimized libraries out there. — Grizzly, Dec 11 '12 at 22:30
No memcpy library copies byte-by-byte now. Even the simplest version of `memcpy` copies at least a word at a time http://www.opensource.apple.com/source/xnu/xnu-2050.18.24/libsyscall/wrappers/memcpy.c Some libraries provide much better memcpy performance than those of "high-end compilers" like Apple's, that's why it's often recommended to disable builtin memcpy on gcc http://stackoverflow.com/questions/1209529/optimized-memcpy https://sourceware.org/ml/libc-help/2008-08/msg00007.html — phuclv, Feb 09 '15 at 08:08

score 5 · Accepted Answer · answered Dec 10 '12 at 00:12

5

Modern Intel and AMD processors have optimisations on REP MOVSB that make it copy entire cache lines at a time if it can, making it the best (may not be fastest, but pretty close) method of copying bulk data.

As for ARM, it depends on the architecture version, but in general using an unrolled loop would be the most efficient.

answered Dec 10 '12 at 00:12

Mutabah

111
1

On ARM, just unrolling the loop is neither necessary nor possible - because you can already load/store all registers in one instruction, that's the beauty of `ldm` / `stm` (load/store _multiple_). But you can, in addition to that, prefetch. Try the glibc ARM memcpy code as referenced above by @unixsmurf with/without the `PLD()` bits - the effect is very significant. – FrankH. Dec 10 '12 at 16:13
1

@FrankH. in ARM64 the ability to load/store all has been removed. You can only load a pair of registers now, but then it's no different than copy using NEON – phuclv Feb 09 '15 at 06:38

Optimal way to move memory in x86 and ARM?

1 Answers1