Fastest way to memset on modern amd64 CPUs

Question

I'd like to fill an array of 4096 bytes (aligned to the 4096-byte boundary) with zeros in amd64 assembly. I'm looking for both portable and single-CPU-type-only solutions.

I know that rep stosq would do the trick, but is there anything faster? MMX? SSE? How much faster is it? How many bytes can be written to memory in a single instruction (without rep)? We can assume that the memory cache is empty. I don't need a fully working function implementation, I just need the basic idea with its crucial assembly instruction.

I've just seen the movdqa instruction which can write 16 bytes at a time. Is it twice as fast as 2 mov instructions of 8 bytes each?

If you know it's out of cache, it might be worth trying the streaming stores. — Mysticial, Mar 12 '14 at 20:41
@Mysticial: Can a streaming store write more than 8 bytes at a time? — pts, Mar 12 '14 at 20:44
The streaming stores only exist as SIMD instructions. SSE2 has the 16-byte streaming store and AVX has the 32-byte streaming store. — Mysticial, Mar 12 '14 at 20:48
Actually. What are you going to do with the zero memory? Do you want to zero *and* pull into cache? Or do you want to "zero and forget". If it's the latter, use streaming stores. It doesn't matter how wide it is since you'll be memory bound anyway. — Mysticial, Mar 12 '14 at 20:49
See http://stackoverflow.com/questions/3654905/faster-way-to-zero-memory-than-with-memset — amdn, Mar 12 '14 at 20:53
See also http://stackoverflow.com/questions/2688466/why-mallocmemset-is-slower-than-calloc/ — amdn, Mar 12 '14 at 21:06
@Mysticial: But is being memory-bound 16-bytes-per-write or 32-bytes-per-write or more? What is the largest number of bytes that I can write at a time, and with which instruction? — pts, Mar 12 '14 at 21:17
@pts Much less than that. If you're memory bound, you're limited by the bandwidth of your memory. A typical desktop today would only get about 5 bytes/cycle. So it doesn't matter how wide the instruction is, your memory will be holding you back. The streaming part will usually help because it eliminates some of the back-and-forth. (Such as not needing to read the cacheline if you're going to overwrite all of it.) — Mysticial, Mar 12 '14 at 21:22
I'm not sure if a cache line requires a read / modify / write sequence in order to do writes, since the lower bits of a virtual address can be used to select individual bytes of a cache line. — rcgldr, Mar 13 '14 at 01:32
@KerrekSB while your advice is good, it's also be helpful to say _how to do this_. The trick is to use `buf = mmap(NULL, PAGE_SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANON /* | MAP_LOCKED */, -1, 0);` (if you're running privileged, the `MAP_LOCKED` will guarantee the page being present already right after the `mmap`). — FrankH., Mar 13 '14 at 11:42

score 2 · Accepted Answer · edited May 23 '17 at 11:52

The answer to your question can be found by looking at the source code in the file memset64.asm in Agner Fog's asmlib.

His code has a version for AVX and SSE. From what I can tell the code does _mm256_store_ps (vmovaps) for some size of the array less than MemsetCacheLimit. For larger array sizes he does non-temporal stores with _mm256_stream_ps (vmovntps). There are several other factors which can affect the results. See the code. You could probably get the same performance for most cases with C/C++ using intrinsic functions.

Note that the both the built-in memset function in GCC as well as the version in glibc last I checked are not optimized (which is one reason memset is in the asmlib).

Fastest way to memset on modern amd64 CPUs

1 Answers1