17

How can the instruction rep stosb execute faster than this code?

    Clear: mov byte [edi],AL       ; Write the value in AL to memory
           inc edi                 ; Bump EDI to next byte in the buffer
           dec ecx                 ; Decrement ECX by one position
           jnz Clear               ; And loop again until ECX is 0

Is that guaranteed to be true on all modern CPUs? Should I always prefer to use rep stosb instead of writing the loop manually?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 7
    What kind of answer do you expect? `rep stosb` happens to be an optimized instruction for this purpose. – Jester Nov 02 '15 at 15:23
  • 1
    Hello Jester thank you very much for the prompt reply. Okay I'll put it this way.. there is an adder in CPU for adding. Likewise for the instruction "rep stosb" is there a separate circuit in the CPU? – Promod Sampath Elvitigala Nov 02 '15 at 16:49

2 Answers2

41

In modern CPUs, rep stosb's and rep movsb's microcoded implementation actually uses stores that are wider than 1B, so it can go much faster than one byte per clock.

(Note this only applies to stos and movs, not repe cmpsb or repne scasb. They're still slow, unfortunately, like at best 2 cycles per byte compared on Skylake, which is pathetic vs. AVX2 vpcmpeqb for implementing memcmp or memchr. See https://agner.org/optimize/ for instruction tables, and other perf links in the x86 tag wiki.

See Why is this code 6.5x slower with optimizations enabled? for an example of gcc unwisely inlining repnz scasb or a less-bad scalar bithack for a strlen that happens to get large, and a simple SIMD alternative.)


rep stos/movs has significant startup overhead, but ramps up well for large memset/memcpy. (See the Intel/AMD's optimization manuals for discussion of when to use rep stos vs. a vectorized loop for small buffers.) Without the ERMSB feature, though, rep stosb is tuned for medium to small memsets and it's optimal to use rep stosd or rep stosq (if you aren't going to use a SIMD loop).

When single-stepping with a debugger, rep stos only does one iteration (one decrement of ecx/rcx), so the microcode implementation never gets going. Don't let this fool you into thinking that's all it can do.

See What setup does REP do? for some details of how Intel P6/SnB-family microarchitectures implement rep movs.

See Enhanced REP MOVSB for memcpy for memory-bandwidth considerations with rep movsb vs. an SSE or AVX loop, on Intel CPUs with the ERMSB feature. (Note especially that many-core Xeon CPUs can't saturate DRAM bandwidth with only a single thread, because of limits on how many cache misses are in flight at once, and also RFO vs. non-RFO store protocols.)


A modern Intel CPU should run the asm loop in the question at one iteration per clock, but an AMD bulldozer-family core probably can't even manage one store per clock. (Bottleneck on the two integer execution ports handling the inc/dec/branch instructions. If the loop condition was a cmp/jcc on edi, an AMD core could macro-fuse the compare-and-branch.)


One major feature of so-called Fast String operations (rep movs and rep stos on Intel P6 and SnB-family CPUs is that they avoid the read-for-ownership cache coherency traffic when storing to not-previously-cached memory. So it's like using NT stores to write whole cache lines, but still strongly ordered. (The ERMSB feature does use weakly-ordered stores).

IDK how good AMD's implementation is.


(And a correction: I previously said that Intel SnB can only handle a taken-branch throughput of one per 2 clocks, but in fact it can run tiny loops at one iteration per one clock.)

See the optimization resources (esp. Agner Fog's guides) linked from the tag wiki.


Intel IvyBridge and later also ERMSB, which lets rep stos[b/w/d/q] and rep movs[b/w/d/q] use weakly-ordered stores (like movnt), allowing the stores to commit to cache out-of-order. This is an advantage if not all of the destination is already hot in L1 cache. I believe, from the wording of the docs, that there's an implicit memory barrier at the end of a fast string op, so any reordering is only visible between stores made by the string op, not between it and other stores. i.e. you still don't need sfence after rep movs.

So for large aligned buffers on Intel IvB and later, a rep stos implementation of memset can beat any other implementation. One that uses movnt stores (which don't leave the data in cache) should also be close to saturating main memory write bandwidth, but may in practice not quite keep up. See comments for discussion of this, but I wasn't able to find any numbers.

For small buffers, different approaches have very different amounts of overhead. Microbenchmarks can make SSE/AVX copy-loops look better than they are, because doing a copy with the same size and alignment every time avoids branch mispredicts in the startup/cleanup code. IIRC, it's recommended to use a vectorized loop for copies under 128B on Intel CPUs (not rep movs). The threshold may be higher than that, depending on the CPU and the surrounding code.

Intel's optimization manual also has some discussion of overhead for different memcpy implementations, and that rep movsb has a larger penalty for misalignment than movdqu.


See the code for an optimized memset/memcpy implementation for more info on what is done in practice. (e.g. Agner Fog's library).

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Great answer. I actually did a comparison of the different approaches some time ago, see here: http://stackoverflow.com/questions/27940150/how-to-populate-a-64-bit-register-with-duplicate-byte-values/27944531#27944531 Unfortunately the full code has been removed by pastebin. – zx485 Nov 03 '15 at 10:00
  • Why is SNB limited to one byte per two clocks? More importantly, I am not sure your statement about `rep stos` and `movnt` on IVB and later is correct. I would argue it's the other way around. See my answer to [whats-missing-sub-optimal-in-this-memcpy-implementation](http://stackoverflow.com/a/26256216/2542702) and also read the comments to my answer. – Z boson Nov 03 '15 at 12:56
  • @Zboson: my mistake, for some reason I thought I remembered finding that SnB couldn't do one-iteration-per-clock loops. Maybe my test was suffering from some weird alignment thing? Anyway, I just re-tested, and it can run a tiny loop at one iteration per clock after all. – Peter Cordes Nov 04 '15 at 00:13
  • 1
    crap, I'm going to have to go and fix all my recent answers where I included that error about SnB loop throughput. – Peter Cordes Nov 04 '15 at 00:26
  • @PeterCordes, I was more concerned about your statement "on Intel IvB and later, a rep stos implementation of memset can beat any implementation except one that uses movnt stores". I think it is incorrect. On IVB and later Intel has what it calls "Enhanced stosb". This should beat even `movnt`. See the Intel optimization manual section 3.7.6. See the link in my comment above for more details. – Z boson Nov 04 '15 at 08:26
  • @Zboson: I meant asymptotically, when main memory write bandwidth is the bottleneck, I'm pretty sure rep stos and movnt should tie with each other. I haven't tested this, and don't remember any detailed benchmarks I saw, so correct me if I'm wrong. rep stos is a much better choice for smaller buffer sizes because it leaves the data in cache. On CPUs without the "Fast String" feature enabled, movnt memset should win for large buffers. – Peter Cordes Nov 04 '15 at 17:15
  • 2
    From the link I pointed you to I wrote "So you mean I can do better than movntdqa for my 1 GB case?" and then Stephen Canon wrote "Yes, rep movsb is significantly faster than movntdqa when streaming to memory on Ivybridge and Haswell (but be aware that pre-Ivybridge it is slow!)" – Z boson Nov 04 '15 at 19:37
  • So I think since IVB rep stos should be faster than movnt even for large sizes. But I have not tested this in practice. I only know what Stephen Canon wrote. I agree that pre IVB that movnt is better. – Z boson Nov 04 '15 at 19:38
  • @Zboson: I was doing some reading. I found some stuff in Intel's own optimization guide, but mostly talking about sizes up to 4k. I assumed without checking that movntdqa could saturate main memory write bandwidth. If Stephen Canon says fast-string ops are faster, then I suppose my guess was wrong. However, note that I'm talking about memset (`stosb`), not memcpy (`movsb`) mixed read and write, with potential false dependencies etc. etc. Intel says `rep movsb` apparently slows down more than SSE with misaligned addresses (talking about movdqu I guess, since movnt is aligned-only.) – Peter Cordes Nov 04 '15 at 21:19
  • Intel says "enhanced stosb" as well so I think it applies to memset as well as memcpy. But since I have never tested it I can only guess based on Stephen Canon's statement. I'm not sure why alignment matters. If you're setting 1 GB then it's insignificant to add a little code until it's 16 byte aligned. – Z boson Nov 04 '15 at 21:23
  • @Zboson: yes, it applies to `stosb`. My point was that a `movnt` memset loop would have an easier time saturating memory bandwidth than a memcpy loop, because are no false-dependency hazards. No mixing of read and write at all, in fact. Updated my answer. I'm not sure if alignment for `rep stos/movs` matters much for large buffers, or if it just incurs high startup overhead (which matters for small buffers) that's worse than SSE using unaligned ops. – Peter Cordes Nov 04 '15 at 21:35
  • Note: "fast strings" and the ability to work on cache lines has existed since the 1990s (in Pentium II if not before), and isn't limited to modern CPUs. Over time Intel didn't keep optimising it for each specific CPU so it fell behind. ERMSB is Intel saying that they finally got around to optimising "fast strings" in modern CPUs. – Brendan May 08 '17 at 01:51
  • 1
    @Brendan: Hmm, you're right about terminology. I shouldn't be using "fast strings" as a synonym for ERMSB in this answer. And yes, "Fast Strings" (wide stores for the microcode implementation of `rep stos` and `rep movs` (but not the compare ops) dates back to PPro (the first P6 core, ancestor of Pentium II). Andy Glew (Intel's lead architect for fast strings) [corrected me on this only a week after I posted this answer](http://stackoverflow.com/questions/8858778/why-are-complicated-memcpy-memset-superior#comment55038727_9177369), while discussing some details of rep movs :P – Peter Cordes May 08 '17 at 04:34
8

If your CPU has CPUID ERMSB bit, then rep movsb and rep stosb commands are executed differently than on older processors.

See Intel Optimization Reference Manual, section 3.7.6 Enhanced REP MOVSB and REP STOSB operation (ERMSB).

Both the manual and my tests show that the benefits of rep stosb comparing to generic 32-bit register moves on a 32-bit CPU of Skylake microarchitecture appear only on large memory blocks, larger than 128 bytes. On smaller blocks, like 5 bytes, the code that you have shown (mov byte [edi],al; inc edi; dec ecx; jnz Clear) would be much faster, since the startup costs of rep stosb are very high - about 35 cycles. However, this speed difference has diminished on Ice Lake microarchitecture launched in September 2019, introducing the Fast Short REP MOV (FSRM) feature. This feature can be tested by a CPUID bit. It was intended for 128 bytes and shorter strings to be quick, but, in fact, strings before 64 bytes are still slower with rep movsb than with, for example, simple 64-bit register copy. Besides that, FSRM is only implemented under 64-bit, not under 32-bit. At least on my i7-1065G7 CPU, rep movsb is only quick for small strings under 64-bit, but, on 32-bit, strings have to be at least 4KB in order for rep movsb to start outperforming other methods.

To get the benefits of rep stosb on the processors with CPUID ERMSB bit, the following conditions should be met:

  • the destination buffer has to be aligned to a 16-byte boundary;
  • if the length is a multiple of 64, it can produce even higher performance;
  • the direction bit should be set "forward" (set by the cld instruction).

According to the Intel Optimization Manual, ERMSB begins to outperform memory store via regular register on Skylake when the length of the memory block is at least 128 bytes. As I wrote, there is high internal startup ERMSB - about 35 cycles. ERMSB begins to clearly outperform other methods, including AVX copy and fill, when the length is more than 2048 bytes. However, this mainly applies to Skylake microarchitecture and not necessarily be the case for the other CPU microarchitectures.

On some processors, but not on the other, when the destination buffer is 16-byte aligned, REP STOSB using ERMSB can perform better than SIMD approaches, i.e., when using MMX or SSE registers. When the destination buffer is misaligned, memset() performance using ERMSB can degrade about 20% relative to the aligned case, for processors based on Intel microarchitecture code name Ivy Bridge. In contrast, SIMD implementation of REP STOSB will experience more negligible degradation when the destination is misaligned, according to Intel's optimization manual.

Benchmarks

I've done some benchmarks. The code was filling the same fixed-size buffer many times, so the buffer stayed in cache (L1, L2, L3), depending on the size of the buffer. The number of iterations was such as the total execution time should be about two seconds.

Skylake

On Intel Core i5 6600 processor, released on September 2015 and based on Skylake-S quad-core microarchitecture (3.30 GHz base frequency, 3.90 GHz Max Turbo frequency) with 4 x 32K L1 cache, 4 x 256K L2 cache and 6MB L3 cache, I could obtain ~100 GB/sec on REP STOSB with 32K blocks.

The memset() implementation that uses REP STOSB:

  • 1297920000 blocks of 16 bytes: 13.6022 secs 1455.9909 Megabytes/sec
  • 0648960000 blocks of 32 bytes: 06.7840 secs 2919.3058 Megabytes/sec
  • 1622400000 blocks of 64 bytes: 16.9762 secs 5833.0883 Megabytes/sec
  • 817587402 blocks of 127 bytes: 8.5698 secs 11554.8914 Megabytes/sec
  • 811200000 blocks of 128 bytes: 8.5197 secs 11622.9306 Megabytes/sec
  • 804911628 blocks of 129 bytes: 9.1513 secs 10820.6427 Megabytes/sec
  • 407190588 blocks of 255 bytes: 5.4656 secs 18117.7029 Megabytes/sec
  • 405600000 blocks of 256 bytes: 5.0314 secs 19681.1544 Megabytes/sec
  • 202800000 blocks of 512 bytes: 2.7403 secs 36135.8273 Megabytes/sec
  • 101400000 blocks of 1024 bytes: 1.6704 secs 59279.5229 Megabytes/sec
  • 3168750 blocks of 32768 bytes: 0.9525 secs 103957.8488 Megabytes/sec (!), i.e., 103 GB/s
  • 2028000 blocks of 51200 bytes: 1.5321 secs 64633.5697 Megabytes/sec
  • 413878 blocks of 250880 bytes: 1.7737 secs 55828.1341 Megabytes/sec
  • 19805 blocks of 5242880 bytes: 2.6009 secs 38073.0694 Megabytes/sec

The memset() implementation that uses MOVDQA [RCX],XMM0:

  • 1297920000 blocks of 16 bytes: 3.5795 secs 5532.7798 Megabytes/sec
  • 0648960000 blocks of 32 bytes: 5.5538 secs 3565.9727 Megabytes/sec
  • 1622400000 blocks of 64 bytes: 15.7489 secs 6287.6436 Megabytes/sec
  • 817587402 blocks of 127 bytes: 9.6637 secs 10246.9173 Megabytes/sec
  • 811200000 blocks of 128 bytes: 9.6236 secs 10289.6215 Megabytes/sec
  • 804911628 blocks of 129 bytes: 9.4852 secs 10439.7473 Megabytes/sec
  • 407190588 blocks of 255 bytes: 6.6156 secs 14968.1754 Megabytes/sec
  • 405600000 blocks of 256 bytes: 6.6437 secs 14904.9230 Megabytes/sec
  • 202800000 blocks of 512 bytes: 5.0695 secs 19533.2299 Megabytes/sec
  • 101400000 blocks of 1024 bytes: 4.3506 secs 22761.0460 Megabytes/sec
  • 3168750 blocks of 32768 bytes: 3.7269 secs 26569.8145 Megabytes/sec (!) i.e., 26 GB/s
  • 2028000 blocks of 51200 bytes: 4.0538 secs 24427.4096 Megabytes/sec
  • 413878 blocks of 250880 bytes: 3.9936 secs 24795.5548 Megabytes/sec
  • 19805 blocks of 5242880 bytes: 4.5892 secs 21577.7860 Megabytes/sec

Please note that the drawback of using the XMM0 register is that it is 128 bits (16 bytes) while I could have used YMM0 register of 256 bits (32 bytes). Anyway, stosb uses the non-RFO protocol. Intel x86 have had "fast strings" since the Pentium Pro (P6) in 1996. The P6 fast strings took REP MOVSB and larger, and implemented them with 64 bit microcode loads and stores and a non-RFO cache protocol. They did not violate memory ordering, unlike ERMSB in Ivy Bridge. See https://stackoverflow.com/a/33905887/6910868 for more details and the source.

Anyway, even you compare just two of the methods that I have provided, and even though the second method is far from ideal, as you see, on 64-bit blocks rep stosb is slower, but starting from 128-byte blocks, rep stosb begin to outperform other methods, and the difference is very significant starting from 512-byte blocks and longer, provided that you are clearing the same memory block again and again within the cache.

Therefore, for REP STOSB, maximum speed was 103957 (one hundred three thousand nine hundred fifty-seven) Megabytes per second, while with MOVDQA [RCX],XMM0 it was just 26569 (twenty-six thousand five hundred sixty-nine) twenty-six thousand five hundred sixty-nine.

As you see, the highest performance was on 32K blocks, which is equal to 32K L1 cache of the CPU on which I've made the benchmarks.

Ice Lake

REP STOSB vs AVX-512 store

I have also done tests on an Intel i7 1065G7 CPU, released in August 2019 (Ice Lake/Sunny Cove microarchitecture), Base frequency: 1.3 GHz, Max Turbo frequency 3.90 GHz. It supports AVX512F instruction set. It has 4 x 32K L1 instruction cache and 4 x 48K data cache, 4x512K L2 cache and 8 MB L3 cache.

Destination alignment

On 32K blocks zeroized by rep stosb, performance was from 175231 MB/s for destination misaligned by 1 byte (e.g. $7FF4FDCFFFFF) and quickly rose to 219464 MB/s for aligned by 64 bytes (e.g. $7FF4FDCFFFC0), and then gradually rose to 222424 MB/sec for destinations aligned by 256 bytes (Aligned to 256 bytes, i.e. $7FF4FDCFFF00). After that, the speed did not rise, even if destination was aligned by 32KB (e.g. $7FF4FDD00000), and was still 224850 MB/sec.

There was no difference in speed between rep stosb and rep stosq.

On buffers aligned by 32K, the speed of AVX-512 store was exactly the same as for rep stosb, for loops starting from 2 stores in a loop (227777 MB/sec) and didn't grow for loops unrolled for 4 and even 16 stores. However, for a loop of just 1 store the speed was a little bit lower - 203145 MB/sec.

However, if the destination buffer was misaligned by just 1 byte, the speed of AVX512 store dropped dramatically, i.e. more than 2 times, to 93811 MB/sec, in contrast to rep stosb on similar buffers, which gave 175231 MB/sec.

Buffer Size

  • For 1K (1024 bytes) blocks, AVX-512 (205039 KB/s) was 3 times faster than rep stosb (71817 MB/s)
  • And for 512 bytes blocks, AVX-512 performance was always the same as for larger block types (194181 MB/s), while rep stosb dropped to 38682 MB/s. At this block type, the difference was 5 times in favor of AVX-512.
  • For 2K (2048) blocks, AVX-512 had 210696 MB/s, while for rep stosb it was 123207 MB/s, almost twice slower. Again, there was no difference between rep stosb and rep stosq.
  • For 4K (4096) blocks, AVX-512 had 225179 MB/s, while rep stosb: 180384 MB/s, almost catching up.
  • For 8K (8192) blocks, AVX-512 had 222259 MB/s, while rep stosb: 194358 MB/s, close!
  • For 32K (32768) blocks, AVX-512 had 228432 MB/s, rep stosb: 220515 MB/s - now at last! We are approaching the L0 data cache size of my CPU - 48Kb! This is 220 Gigabytes per second!
  • For 64K (65536) blocks, AVX-512 had 61405 MB/s, rep stosb: 70395 MB/s!
  • Such a huge drop when we ran out of the L0 cache! And, it was evident that, from this point, rep stosb begins to outperform AVX-512 stores.
  • Now let's check the L1 cache size. For for 512K blocks, AVX-512 made 62907 MB/s and rep stosb made 70653 MB/s. That's where rep stosb begins to outperform AVX-512. The difference is not yet significant, but the bigger the buffer, the bigger the difference.
  • Now let's take a huge buffer of 1GB (1073741824). With AVX-512, the speed was 14319 MB/s, rep stosb it as 27412 MB/s, i.e. twice as fast as AVX-512!

I've also tried to use non-temporal instructions for filling the 32K buffers vmovntdq [rcx], zmm31, but the performance was about 4 time slower than just vmovdqa64 [rcx], zmm31. How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?

Also, if the destination buffers are aligned by at least 64 bits, there is no performance difference in vmovdqa64 vs vmovdqu64. Therefore, I do have a question: does the instruction vmovdqa64 is only needed for debugging and safety when we have vmovdqu64?

Figure 1: Speed of iterative store to the same buffer, MB/s

block     AVX   stosb
-----   -----  ------
 0.5K  194181   38682
   1K  205039  205039
   2K  210696  123207
   4K  225179  180384
   8K  222259  194358 
  32K  228432  220515 
  64K   61405   70395 
 512K   62907   70653 
   1G   14319   27412

Summary on performance of multiple clearing the same memory block within the cache

rep stosb on Ice Lake CPUs begins to outperform AVX-512 stores only for repeatedly clearing the same memory buffer larger than the L0 cache size, i.e. 48K on the Intel i7 1065G7 CPU. And on small memory buffers, AVX-512 stores are much faster: for 1KB - 3 times faster, for 512 bytes - 5 times faster.

However, the AVX-512 stores are susceptible to misaligned buffers, while rep stosb is not as sensitive to misalignment.

Therefore, I have figured out that rep stosb begins to outperform AVX-512 stores only on buffers that exceed L0 data cache size, or 48KB as in case of the Intel i7 1065G7 CPU. This conclusion is valid at least on Ice Lake CPUs. An earlier Intel recommendation that string copy begins to outperform AVX copy starting from 2KB buffers also should be re-tested for newer microarchitectures.

Clearing different memory buffers, each only once

My previous benchmarks were filling the same buffer many times in row. A better benchmark might be to allocate many different buffers and only fill each buffer once, to not interfere with the cache.

In this scenario, there is no much difference at all between rep stosb and AVX-512 stores. The only difference is when all the data does not come close to a physical memory limit, under Windows 10 64 bit. In the following benchmarks, the total data size was below 8 GB with total physical ram of 16 GB. When I was allocating about 12 GB, performance drops about 20 times, regardless of the method. Windows began to discard purged memory pages, and probably did some other stuff when the memory was about to be full. The L3 cache size of 8MB on the i7 1065G7 CPU did not seem to matter the benchmarks at all. All that matters is that you didn't have to come close to physical memory limit, and it depends on your operating system on how it handles such situations. As I said, under Windows 10, if I took just half physical memory, it was OK, but it I took 3/4 of available memory, my benchmark slowed 20 times. I didn't even try to take more than 3/4. As I told, the total memory size is 16 GB. The amount available, according to the task manager, was 12 GB.

Here is the benchmark of the speed of filling various blocks of memory totalling 8 GB with zeros (in MB/sec) on the i7 1065G7 CPU with 16 GB total memory, single-threaded. By "AVX" I mean "AVX-512" normal stores, and by "stosb" I mean "rep stosb".

Figure 2: Speed of store to the multiple buffers, once each, MB/s

block    AVX  stosb
-----   ----   ----
 0.5K   3641   2759
   1K   4709   3963
   2K  12133  13163
   4K   8239  10295
   8K   3534   4675
  16K   3396   3242
  32K   3738   3581
  64K   2953   3006
 128K   3150   2857
 256K   3773   3914
 512K   3204   3680
1024K   3897   4593
2048K   4379   3234
4096K   3568   4970
8192K   4477   5339

Conclusion on clearing the memory within the cache

If your memory does not exist in the cache, than the performance of AVX-512 stores and rep stosb is about the same when you need to fill memory with zeros. It is the cache that matters, not the choice between these two methods.

The use of non-temporal store to clear the memory which is not in the cache

I was zeroizing 6-10 GB of memory split by a sequence of buffers aligned by 64 bytes. No buffers were zeroized twice. Smaller buffers had some overhead, and I had only 16 GB of physical memory, so I zeroized less memory in total with smaller buffers. I used various tests for the buffers starting from 256 bytes and up to to 8 GB per buffer. I took 3 different methods:

  1. Normal AVX-512 store by vmovdqa64 [rcx+imm], zmm31 (a loop of 4 stores and then compare the counter);
  2. Non-temporal AVX-512 store by vmovntdq [rcx+imm], zmm31 (same loop of 4 stores);
  3. rep stosb.

For small buffers, the normal AVX-512 store was the winner. Then, starting from 4KB, the non-temporal store took the lead, while rep stosb still lagged behind.

Then, from 256KB, rep stosb outperformed AVX-512, but not the non-temporal store, and since that, the situation didn’t change. The winner was a non-temporal AVX-512 store, then came rep stosb and then the normal AVX-512 store.

Figure 3. Speed of store to the multiple buffers, once each, MB/s by three different methods: normal AVX-512 store, nontemporal AVX-512 store and rep stosb.

Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 2.90s, 2.30 GB/s by normal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by nontemporal AVX-512 store
Zeroized 6.67 GB: 27962026 blocks of 256 bytes for 3.05s, 2.18 GB/s by rep stosb

Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.06s, 2.62 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.02s, 2.65 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 16777216 blocks of 512 bytes for 3.66s, 2.18 GB/s by rep stosb

Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.10s, 2.87 GB/s by normal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 3.37s, 2.64 GB/s by nontemporal AVX-512 store
Zeroized 8.89 GB: 9320675 blocks of 1 KB for 4.85s, 1.83 GB/s by rep stosb

Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.45s, 2.73 GB/s by normal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 3.79s, 2.48 GB/s by nontemporal AVX-512 store
Zeroized 9.41 GB: 4934475 blocks of 2 KB for 4.83s, 1.95 GB/s by rep stosb

Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by normal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 3.46s, 2.81 GB/s by nontemporal AVX-512 store
Zeroized 9.70 GB: 2542002 blocks of 4 KB for 4.40s, 2.20 GB/s by rep stosb

Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.24s, 3.04 GB/s by normal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 2.65s, 3.71 GB/s by nontemporal AVX-512 store
Zeroized 9.85 GB: 1290555 blocks of 8 KB for 3.35s, 2.94 GB/s by rep stosb

Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.37s, 2.94 GB/s by normal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 2.73s, 3.63 GB/s by nontemporal AVX-512 store
Zeroized 9.92 GB: 650279 blocks of 16 KB for 3.53s, 2.81 GB/s by rep stosb

Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.19s, 3.12 GB/s by normal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 2.64s, 3.77 GB/s by nontemporal AVX-512 store
Zeroized 9.96 GB: 326404 blocks of 32 KB for 3.44s, 2.90 GB/s by rep stosb

Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.08s, 3.24 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 2.58s, 3.86 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 163520 blocks of 64 KB for 3.29s, 3.03 GB/s by rep stosb

Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.22s, 3.10 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 2.49s, 4.01 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 81840 blocks of 128 KB for 3.26s, 3.07 GB/s by rep stosb

Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.52s, 3.97 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 1.98s, 5.06 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 40940 blocks of 256 KB for 2.43s, 4.11 GB/s by rep stosb

Zeroized 10.00 GB: 20475 blocks of 512 KB for 2.15s, 4.65 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.70s, 5.87 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 20475 blocks of 512 KB for 1.81s, 5.53 GB/s by rep stosb

Zeroized 10.00 GB: 10238 blocks of 1 MB for 2.18s, 4.59 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.50s, 6.68 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 10238 blocks of 1 MB for 1.63s, 6.13 GB/s by rep stosb

Zeroized 10.00 GB: 5119 blocks of 2 MB for 2.02s, 4.96 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.59s, 6.30 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 5119 blocks of 2 MB for 1.54s, 6.50 GB/s by rep stosb

Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.90s, 5.26 GB/s by normal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.37s, 7.29 GB/s by nontemporal AVX-512 store
Zeroized 10.00 GB: 2559 blocks of 4 MB for 1.47s, 6.81 GB/s by rep stosb

Zeroized 9.99 GB: 1279 blocks of 8 MB for 2.04s, 4.90 GB/s by normal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.51s, 6.63 GB/s by nontemporal AVX-512 store
Zeroized 9.99 GB: 1279 blocks of 8 MB for 1.56s, 6.41 GB/s by rep stosb

Zeroized 9.98 GB: 639 blocks of 16 MB for 1.93s, 5.18 GB/s by normal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.37s, 7.30 GB/s by nontemporal AVX-512 store
Zeroized 9.98 GB: 639 blocks of 16 MB for 1.45s, 6.89 GB/s by rep stosb

Zeroized 9.97 GB: 319 blocks of 32 MB for 1.95s, 5.11 GB/s by normal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.41s, 7.06 GB/s by nontemporal AVX-512 store
Zeroized 9.97 GB: 319 blocks of 32 MB for 1.42s, 7.02 GB/s by rep stosb

Zeroized 9.94 GB: 159 blocks of 64 MB for 1.85s, 5.38 GB/s by normal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.33s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 9.94 GB: 159 blocks of 64 MB for 1.40s, 7.09 GB/s by rep stosb

Zeroized 9.88 GB: 79 blocks of 128 MB for 1.99s, 4.96 GB/s by normal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.42s, 6.97 GB/s by nontemporal AVX-512 store
Zeroized 9.88 GB: 79 blocks of 128 MB for 1.55s, 6.37 GB/s by rep stosb

Zeroized 9.75 GB: 39 blocks of 256 MB for 1.83s, 5.32 GB/s by normal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.32s, 7.38 GB/s by nontemporal AVX-512 store
Zeroized 9.75 GB: 39 blocks of 256 MB for 1.64s, 5.93 GB/s by rep stosb

Zeroized 9.50 GB: 19 blocks of 512 MB for 1.89s, 5.02 GB/s by normal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.31s, 7.27 GB/s by nontemporal AVX-512 store
Zeroized 9.50 GB: 19 blocks of 512 MB for 1.42s, 6.71 GB/s by rep stosb

Zeroized 9.00 GB: 9 blocks of 1 GB for 1.76s, 5.13 GB/s by normal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.26s, 7.12 GB/s by nontemporal AVX-512 store
Zeroized 9.00 GB: 9 blocks of 1 GB for 1.29s, 7.00 GB/s by rep stosb

Zeroized 8.00 GB: 4 blocks of 2 GB for 1.48s, 5.42 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.07s, 7.49 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 4 blocks of 2 GB for 1.15s, 6.94 GB/s by rep stosb

Zeroized 8.00 GB: 2 blocks of 4 GB for 1.48s, 5.40 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.08s, 7.40 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 2 blocks of 4 GB for 1.14s, 7.00 GB/s by rep stosb

Zeroized 8.00 GB: 1 blocks of 8 GB for 1.50s, 5.35 GB/s by normal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.07s, 7.47 GB/s by nontemporal AVX-512 store
Zeroized 8.00 GB: 1 blocks of 8 GB for 1.21s, 6.63 GB/s by rep stosb

Avoiding AVX-SSE transition penalties

For all the AVX-512 code, I've used the ZMM31 register, because SSE registers come from 0 to to 15, so the AVX-512 registers 16 to 31 do not have their SSE counterparts, thus do not incur the transition penalty.

Maxim Masiutin
  • 3,991
  • 4
  • 55
  • 72
  • 2
    *than other methods* - well, compared to that *one* other method you tested, which apparently only stored 16 bytes every other clock cycle. (~104GB/s is I'm assuming 32B/c on a ~3.3GHz CPU.) A manual memset loop should achieve one store per cycle with hits in L1d cache, so your test loop is making SSE look bad. And if you'd used AVX, you should be able to match memset for small to medium blocks. (Cache hits, where we don't want to use movnt or whatever no-RFO protocol rep stos might use.) – Peter Cordes May 16 '21 at 01:36
  • @PeterCordes - I have added a section about AVX-512 stores, please review. – Maxim Masiutin May 16 '21 at 05:55
  • *How can I take benefits of vmovntdq when filling memory buffers? Should there be some specific size of the buffer in order vmovntdq to take an advantage?* - yes, real-world memcpy / memset implementations like glibc's have a tuning variable which they compare against the size to decide whether to use NT stores for huge copies. – Peter Cordes May 16 '21 at 13:44
  • @PeterCordes I don't know in which particular setting should I use vmovntdq, but on all of my tests of up to a 32K buffer filled multiple times it was about 3 times slower. DIdn't test other setups though. – Maxim Masiutin May 16 '21 at 17:08
  • Yeah, of course it's slower for sizes that fit in L1d cache! It forces the stores to go all the way to DRAM. The tuning threshold for using NT stores would normally be something like the size of L3 cache, I think. – Peter Cordes May 16 '21 at 17:21
  • @PeterCordes Thank you! We discussed AVX-512 about three years ago when the only embodiment was Knight Landing which I did not have. Then, in June 2017, the first mainstream CPU, Core i9-7900X, became available, and I've managed to get in July 2017 to try AVX-512 instructions. Now, even the notebooks with i7 1065G7 have AVX-512, so things have changed since then, and the optimization strategies also changed. So I'm trying to keep up with these changes. – Maxim Masiutin May 16 '21 at 17:29
  • Note that Ice Lake (i7 1065G7) has the "fast short-rep" feature, which apparently only speeds up `rep movsb` for 1..128 bytes, unfortunately not stosb. https://www.phoronix.com/scan.php?page=news_item&px=Intel-5.6-FSRM-Memmove. Other than that, SKX is a "server" chip with a mesh interconnect (and lower single-core memory bandwidth), while ICL is a "client" chip. Still, I don't think NT stores were ever good for small buffers, especially not if something's going to read the buffer soon. (Avoiding NT stores can let that hit in cache if the buffer is small enough). – Peter Cordes May 16 '21 at 17:34
  • @PeterCordes -- Thank you! I will do more benchmarks. I will check the fast short rep as well. It seems that the fast short rep is a nice alternative to AVX moves, especially when the code size of rep movsb is microscopic comparing to the AVX moves, and, on average, it should provide good performance! – Maxim Masiutin May 16 '21 at 20:10
  • Note that using 512-bit registers at all imposes a max-turbo penalty, so it's not something you want to do just for memset / memcpy in a program that otherwise doesn't do much / anything with 512-bit vectors. – Peter Cordes May 17 '21 at 03:45
  • Your last set of benchmarks showing speeds around 3GB/s for buffers that fit in L1d cache is *way* slower than one would expect. IDK if you were getting page faults during those tests (like unmapping and re-mapping them?) but your early benchmarks showed 60x better performance (e.g. you mention 219464MB/s = 219GB/s aligned rep stosb on the same Ice Lake CPU. So clearly you're doing something wrong if that performance doesn't show up for any size. – Peter Cordes May 17 '21 at 03:51
  • Also, that last section mentions "temporal" stores. That's a weird way to describe normal stores; you know "*non*-temporal" means "won't be re-read any time soon", right? Temporal in general means related to time. Anyway, I'd suggest you avoid using the term "temporal" anywhere, and use "normal" vs. "non-temporal", since I'm not 100% sure whether you meant "normal" or if "temporal" was a typo for "non-temporal" – Peter Cordes May 17 '21 at 03:54
  • 1
    @PeterCordes these earlier 60x better performance cases were on the same buffer over and over again, while later cases did only use each buffer just once. That's why the difference in performance. – Maxim Masiutin May 17 '21 at 15:30
  • @PeterCordes - Thank you! I have fixed the "temporal" typo – Maxim Masiutin May 17 '21 at 18:16
  • Is there any difference between `stosb` and, say, `stosd`? – Dan M. Sep 02 '21 at 10:07
  • 1
    @DanM. I've done tests on SkyLake and later microarchitectures and both `stosb` and `stosd` are the same. There were some cases where `stosd` (`stosq`) was faster than `stosb`, but the difference was negligible, I even thought it was minor random difference. I could not prove that `stosq` is always faster. But anyway `stosq` was never slower than `stosb`. – Maxim Masiutin Sep 02 '21 at 12:59