One popular answer on StackOverflow, that does use x86-64 assembly and SSE can be found here: Very fast memcpy for image processing?. If you do use this code, you'll need to make sure your buffers are 128-bit aligned. A basic explanation for that code is that:
- Non-temporal stores are used, so unnecessary cache writes can be bypassed and writes to main memory can be combined.
- Reads and writes are interleaved in only very large chunks (doing many reads and then many writes). Performing many reads back-to-back typically has better performance than single read-write-read-write patterns.
- Much larger registers are used (128 bit SSE registers).
- Prefetch instructions are included as hints to the CPU's pipelining.
I found this document - Optimizing CPU to Memory Accesses on the SGI Visual Workstations 320 and 540 - which seems to be the inspiration of the above code, but for older processor generations; however, it does contain a significant amount of discussion on how it works.
For instance, consider this discussion on write-combining / non-temporal stores:
The Pentium II and III CPU caches operate on 32-byte cache-line sized
blocks. When data is written to or read from (cached) memory, entire
cache lines are read or written. While this generally enhances
CPU-memory performance, under some conditions it can lead to
unnecessary data fetches. In particular, consider a case where the CPU
will do an 8-byte MMX register store: movq
. Since this is only one
quarter of a cache line, it will be treated as a read-modify-write
operation from the cache's perspective; the target cache line will be
fetched into cache, then the 8-byte write will occur. In the case of a
memory copy, this fetched data is unnecessary; subsequent stores will
overwrite the remainder of the cache line. The read-modify-write
behavior can be avoided by having the CPU gather all writes to a cache
line then doing a single write to memory. Coalescing individual writes
into a single cache-line write is referred to as write combining.
Write combining takes place when the memory being written to is
explicitly marked as write combining (as opposed to cached or
uncached), or when the MMX non-temporal store instruction is used.
Memory is generally marked write combining only when it is used in
frame buffers; memory allocated by VirtualAlloc
is either uncached or
cached (but not write combining). The MMX movntps
and movntq
non-temporal store instructions instruct the CPU to write the data
directly to memory, bypassing the L1 and L2 caches. As a side effect,
it also enables write combining if the target memory is cached.
If you'd prefer to stick with memcpy, consider investigating the source code for the memcpy implementation you're using. Some memcpy implementations look for native-word-aligned buffers to improve performance by using the full register size; others will automatically copy as much as possible using native-word-aligned and then mop-up the remainders. Making sure your buffers are 8-byte aligned will facilitate these mechanisms.
Some memcpy implementations contain a ton of up-front conditionals to make it efficient for small buffers (<512) - you may want to consider a copy-paste of the code with those chunks ripped out since you're presumably not working with small buffers.