Why is std::fill(0) slower than std::fill(1)?

Question

I have observed on a system that std::fill on a large std::vector<int> was significantly and consistently slower when setting a constant value 0 compared to a constant value 1 or a dynamic value:

5.8 GiB/s vs 7.5 GiB/s

However, the results are different for smaller data sizes, where fill(0) is faster:

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

This raises the secondary question, why the peak bandwidth of fill(1) is so much lower.

The test system for this was a dual socket Intel Xeon CPU E5-2680 v3 set at 2.5 GHz (via /sys/cpufreq) with 8x16 GiB DDR4-2133. I tested with GCC 6.1.0 (-O3) and Intel compiler 17.0.1 (-fast), both get identical results. GOMP_CPU_AFFINITY=0,12,1,13,2,14,3,15,4,16,5,17,6,18,7,19,8,20,9,21,10,22,11,23 was set. Strem/add/24 threads gets 85 GiB/s on the system.

I was able to reproduce this effect on a different Haswell dual socket server system, but not any other architecture. For example on Sandy Bridge EP, memory performance is identical, while in cache fill(0) is much faster.

Here is the code to reproduce:

#include <algorithm>
#include <cstdlib>
#include <iostream>
#include <omp.h>
#include <vector>

using value = int;
using vector = std::vector<value>;

constexpr size_t write_size = 8ll * 1024 * 1024 * 1024;
constexpr size_t max_data_size = 4ll * 1024 * 1024 * 1024;

void __attribute__((noinline)) fill0(vector& v) {
    std::fill(v.begin(), v.end(), 0);
}

void __attribute__((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}

void bench(size_t data_size, int nthreads) {
#pragma omp parallel num_threads(nthreads)
    {
        vector v(data_size / (sizeof(value) * nthreads));
        auto repeat = write_size / data_size;
#pragma omp barrier
        auto t0 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill0(v);
#pragma omp barrier
        auto t1 = omp_get_wtime();
        for (auto r = 0; r < repeat; r++)
            fill1(v);
#pragma omp barrier
        auto t2 = omp_get_wtime();
#pragma omp master
        std::cout << data_size << ", " << nthreads << ", " << write_size / (t1 - t0) << ", "
                  << write_size / (t2 - t1) << "\n";
    }
}

int main(int argc, const char* argv[]) {
    std::cout << "size,nthreads,fill0,fill1\n";
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, 1);
    }
    for (size_t bytes = 1024; bytes <= max_data_size; bytes *= 2) {
        bench(bytes, omp_get_max_threads());
    }
    for (int nthreads = 1; nthreads <= omp_get_max_threads(); nthreads++) {
        bench(max_data_size, nthreads);
    }
}

Presented results compiled with g++ fillbench.cpp -O3 -o fillbench_gcc -fopenmp.

What is the `data size` when you are comparing the number of threads? — Gavin Portwood, Mar 03 '17 at 22:26
Then there must be something wrong with the second plot, the weak-scaling. I can't imagine it would take more than two or so threads to saturate memory bandwidth for a loop with minimal intermediate operations. In fact, you haven't identified the threads count where the bandwidth saturates even at 24 threads. Can you show that it does level out at some finite thread count? — Gavin Portwood, Mar 04 '17 at 18:19
@GavinPortwood On this system it is in accordance with other benchmark numbers that the bandwidth is saturated at ~7 of 12 core for one socket. See for example [the stream numbers](https://www.pugetsystems.com/labs/hpc/Memory-Performance-for-Intel-Xeon-Haswell-EP-DDR4-596/), where there is a factor of ~5 between single core and all cores. What I cannot easily explain is the behavior of the second socket (13-24 threads). I would have expected a similar slope and saturation as for the first socket (1-12 threads). I assume this has something to do with asymmetric thread distribution. — Zulan, Mar 05 '17 at 09:52
@GavinPortwood I reran the experiments with different affinity settings (spreading across the two sockets) and updated the picture. You see the saturation better. But the main pattern remains `fill(1)` has a higher slope but a much lower maximum bandwidth of `fill(0)`. — Zulan, Mar 05 '17 at 10:32
I suspect the anomalous scaling in your original experiment (on the second socket) is related to non-homogenous memory allocation and the resulting QPI communication. That can be verified with Intel's "uncore" PMUs (i think) — Gavin Portwood, Mar 05 '17 at 22:49
I am slowly starting to look into your question http://stackoverflow.com/q/43343231/2542702 — Z boson, Apr 11 '17 at 11:32
FWIW - you found the code difference in your answer and I think Peter Cordes has the answer below: that `rep stosb` is using a non-RFO protocol which halves the number of transactions needed to do a fill. The rest of the behavior mostly falls out of that. There is one other disadvantage the `fill(1)` code has: it can't use 256-bit AVX stores because you aren't specifying `-march=haswell` or whatever, so it has to fall back to 128-bit code. `fill(0)` which calls `memset` get the advantage of `libc` dispatching that calls the AVX version on your platform. — BeeOnRope, Jul 10 '17 at 19:47
You could try with the `-march` argument at compile to do somewhat more of an apples-to-apples comparison: this will mostly help for small buffers that fit in some level of the cache, but not the larger copies. — BeeOnRope, Jul 10 '17 at 19:47
@BeeOnRope `-march=native` gives a `vmovdq` loop, which only seems to increase L1 performance, though not quite to the level of `rep stos`. — Zulan, Jul 10 '17 at 22:07
Right - but was it using `ymm` or `xmm` regs? That's the key difference (256-bit vs 128-bit). I guess your results make sense - I think the L2 has a bandwidth of 32 bytes/cycle, which would seem to need 32 byte stores (at the max of 1 per cycle) to saturate it, but without NT stores the bandwidth is split in half between the actual stores and the RFO requests, so 16 bytes of reads is "enough" to saturate even L2 (same logic applies for L3, more or less). L1, on the hand, can sustain 32 bytes of writes per cycle, so 256-bit is a win there. — BeeOnRope, Jul 10 '17 at 23:30
That was `ymm`, I added the results to my answer, also including intrinsic non-temporal. — Zulan, Jul 11 '17 at 14:57

Peter Cordes · Accepted Answer · 2017-07-17T20:12:41.380

43

From your question + the compiler-generated asm from your answer:

fill(0) is an ERMSB rep stosb which will use 256b stores in an optimized microcoded loop. (Works best if the buffer is aligned, probably to at least 32B or maybe 64B).
fill(1) is a simple 128-bit movaps vector store loop. Only one store can execute per core clock cycle regardless of width, up to 256b AVX. So 128b stores can only fill half of Haswell's L1D cache write bandwidth. This is why fill(0) is about 2x as fast for buffers up to ~32kiB. Compile with -march=haswell or -march=native to fix that.

Haswell can just barely keep up with the loop overhead, but it can still run 1 store per clock even though it's not unrolled at all. But with 4 fused-domain uops per clock, that's a lot of filler taking up space in the out-of-order window. Some unrolling would maybe let TLB misses start resolving farther ahead of where stores are happening, since there is more throughput for store-address uops than for store-data. Unrolling might help make up the rest of the difference between ERMSB and this vector loop for buffers that fit in L1D. (A comment on the question says that -march=native only helped fill(1) for L1.)

Note that rep movsd (which could be used to implement fill(1) for int elements) will probably perform the same as rep stosb on Haswell. Although only the official documentation only guarantees that ERMSB gives fast rep stosb (but not rep stosd), actual CPUs that support ERMSB use similarly efficient microcode for rep stosd. There is some doubt about IvyBridge, where maybe only b is fast. See the @BeeOnRope's excellent ERMSB answer for updates on this.

gcc has some x86 tuning options for string ops (like -mstringop-strategy=alg and -mmemset-strategy=strategy), but IDK if any of them will get it to actually emit rep movsd for fill(1). Probably not, since I assume the code starts out as a loop, rather than a memset.

With more than one thread, at 4 GiB data size, fill(1) shows a higher slope, but reaches a much lower peak than fill(0) (51 GiB/s vs 90 GiB/s):

A normal movaps store to a cold cache line triggers a Read For Ownership (RFO). A lot of real DRAM bandwidth is spent on reading cache lines from memory when movaps writes the first 16 bytes. ERMSB stores use a no-RFO protocol for its stores, so the memory controllers are only writing. (Except for miscellaneous reads, like page tables if any page-walks miss even in L3 cache, and maybe some load misses in interrupt handlers or whatever).

@BeeOnRope explains in comments that the difference between regular RFO stores and the RFO-avoiding protocol used by ERMSB has downsides for some ranges of buffer sizes on server CPUs where there's high latency in the uncore/L3 cache. See also the linked ERMSB answer for more about RFO vs non-RFO, and the high latency of the uncore (L3/memory) in many-core Intel CPUs being a problem for single-core bandwidth.

movntps (_mm_stream_ps()) stores are weakly-ordered, so they can bypass the cache and go straight to memory a whole cache-line at a time without ever reading the cache line into L1D. movntps avoids RFOs, like rep stos does. (rep stos stores can reorder with each other, but not outside the boundaries of the instruction.)

Your movntps results in your updated answer are surprising.
For a single thread with large buffers, your results are movnt >> regular RFO > ERMSB. So that's really weird that the two non-RFO methods are on opposite sides of the plain old stores, and that ERMSB is so far from optimal. I don't currently have an explanation for that. (edits welcome with an explanation + good evidence).

As we expected, movnt allows multiple threads to achieve high aggregate store bandwidth, like ERMSB. movnt always goes straight into line-fill buffers and then memory, so it is much slower for buffer sizes that fit in cache. One 128b vector per clock is enough to easily saturate a single core's no-RFO bandwidth to DRAM. Probably vmovntps ymm (256b) is only a measurable advantage over vmovntps xmm (128b) when storing the results of a CPU-bound AVX 256b-vectorized computation (i.e. only when it saves the trouble of unpacking to 128b).

movnti bandwidth is low because storing in 4B chunks bottlenecks on 1 store uop per clock adding data to the line fill buffers, not on sending those line-full buffers to DRAM (until you have enough threads to saturate memory bandwidth).

@osgx posted some interesting links in comments:

Agner Fog's asm optimization guide, instruction tables, and microarch guide: http://agner.org/optimize/
Intel optimization guide: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf.
NUMA snooping: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/
https://software.intel.com/en-us/articles/intelr-memory-latency-checker
Cache Coherence Protocol and Memory Performance of the Intel Haswell-EP Architecture

See also other stuff in the x86 tag wiki.

edited Jul 17 '17 at 20:12

answered Jul 10 '17 at 17:59

Peter Cordes

328,167
45
605
847

Although `rep movsd` isn't officially covered by the `ermsb` feature, all recent Intel CPUs (and apparently Ryzen) seem to implement it using the same protocol and it ends up generally having indistinguishable performance. Still there is little reason to use since `rep movsb` pretty much offers a superset of the functionality and who knows how Intel and AMD will optimize them in the future, but in the meantime at least existing code that has `rep movsd` effectively gets the benefit of `ermsb`. – BeeOnRope Jul 10 '17 at 19:29
3

The behavior described above of `rep movsb` versus an explicit loop of `movaps` on a single core across various buffer sizes is pretty consistent with what we have seen before on server cores. As you point out, the competition is between a non-RFO protocol and the RFO protocol. The former uses less bandwidth between all cache levels, but especially on server chips has a long latency handoff all the way to memory. Since a single core is generally concurrency limited, the latency matters, and the non-RFO protocol wins, which is what you see in the region beyond the 30 MB L3. – BeeOnRope Jul 10 '17 at 19:37
3

... in the middle of the graph that fits in L3, however, the long server uncore to memory handoff apparently doesn't come into play, so the read reduction offered by non-RFO wins (but actually it's interesting to compare this to NT stores: would they show the same behavior, or is `rep stosb` able to stop the write at L3 rather than go all the way to memory)? FWIW, the situation for `rep stosb` for `fill` is relatively better, empirically, than for `rep movsb` for `memcpy`. Possibly because the former has a 2:1 advantage in traffic versus 3:2 for the latter. – BeeOnRope Jul 10 '17 at 19:42
Some links to measurement on the topic in [this answer](https://stackoverflow.com/a/43574756/149138) under the "Latency Bound Platforms" heading. It is talking about `movsb` not `stosb`, but the same general pattern applies. – BeeOnRope Jul 10 '17 at 21:24
This answer is most excellent, and @BeeOnRope finally clarifies the anomaly for me. I saw your excellent answer before, but now I feel I understood it :). – Zulan Jul 10 '17 at 22:09
1

I tried `movntps` and if I'm using it correctly, it shows the memory-bandwidth among all data sizes - so it doesn't benefit from caches at all. But for a single thread, that is twice the memory bandwidth of `movaps`, and for 24 threads it's slightly higher than `rep stosb`. – Zulan Jul 10 '17 at 22:45
@Zulan - ok that is a very interesting result for `movntps`. It makes sense: `movntps` is saying "force this write all the way to memory" which means you will generally get the same behavior even for smaller sizes. `rep movsb` on the other hand is going to be size-aware, so will only switch into non-RFO protocol at some threshold. A real world implementation of `memset` or `fill` would also likely switch over to NT only after some threshold (often "50% of the L3 cache size" or something like that). – BeeOnRope Jul 10 '17 at 23:05
@BeeOnRope: Can `rep stos` avoid RFO without force-evicting lines from cache, or bypassing the cache? Those are two separate things, so couldn't there be a non-RFO protocol that leaves data in cache? – Peter Cordes Jul 11 '17 at 00:27
@Zulan: just to confirm, you used 128b SSE2 or AVX `_mm_stream_ps` (`[v]movntps [mem], xmm`), not AVX 256b `_mm256_stream_ps` (`vmovntps [mem], ymm`), right? – Peter Cordes Jul 11 '17 at 04:20
@PeterCordes actually, there's no performance difference between 128/256. Please see the update to my answer for detailed results. – Zulan Jul 11 '17 at 14:59
@Zulan: That makes sense, since MOVNT is always going straight into line-fill buffers and then memory. And 128b vectors are enough to saturate that easily. I guess the only time `vmovntps ymm` would be an advantage is when storing the results of a 256b-vectorized computation that was CPU bound (but would be memory-bound if you didn't use NT stores). Unpacking to 128b stores would take extra shuffles, so obviously you just want to use 256b NT stores if your data is already in 256b vectors. – Peter Cordes Jul 11 '17 at 15:03
@PeterCordes did you ever get an answer to your question to @BeeOnRope? "Can rep stos avoid RFO without force-evicting lines from cache, or bypassing the cache? Those are two separate things, so couldn't there be a non-RFO protocol that leaves data in cache?" Also do you know if ```rep stosb``` is ever implemented with non-temporal stores? – Noah Jan 11 '21 at 16:39
@BeeOnRope (or anyone else since BeeOnRope seems inactive) why does ```rep movsb``` have a longer latency handoff on server chips? [The post you linked](https://stackoverflow.com/questions/43343231/enhanced-rep-movsb-for-memcpy/43574756#43574756) seems to indicate that the increase/decrease in latency is due to handoff time from LFB to memory (dma device?) which is dependent on whether the LFB cache line is in LLC or L2. Since ```rep movsb``` prefetches (and that post indicates it prefetches better than a ```movaps``` loop) wouldn't the handoff latency be low or equal to ```movaps``` loop? – Noah Jan 11 '21 at 17:15
@Noah: `rep movsb` can do no-RFO stores (since P6), like movntps except it doesn't force eviction from cache. Since IvB (ERMSB), they're even weakly-ordered. It's still different so we can't say for sure, but NT stores also have a similar slower handoff probably for some similar internal reason. (Which I don't particularly understand.) – Peter Cordes Jan 11 '21 at 17:48
@PeterCordes what do you mean "doesn't force eviction from cache"? Do you mean it a memcpy with ```rep movsb``` will leave cache in state it was before the memcpy? That it doesn't invalidate lines on other cores? or that it will go through cache on the core it is doing the memcpy on (i.e only invalidating other lines rather than loading them fully with an RFO)? or something different all together? It seems it prefetches (so goes through cache in some way unlike ```NT``` store) but I am having trouble making a mental model for what it does internally. – Noah Jan 11 '21 at 17:54
@PeterCordes I am considering making a post asking about the internals of ```rep movsb``` to try and understand all the information in this post, @BeeOnRope's comments, and @BeeOnRope's other post that they linked. Or am I missing something obvious and just misreading? – Noah Jan 11 '21 at 17:56
@Noah: No, of course it doesn't break coherence; it invalidates instead of RFOing before doing a full-line store that makes it pointless to have read the old contents of the line, saving bandwidth. An NT store guarantees that the line won't be present in any caches, like `movaps` + `clflush` (but diff perf). After `rep movsb`, (part of) the destination *can* still be hot in this core's caches when you're done, unlike a `movnt` store loop. That's part of what @ Bee's ERMSB answer explains, isn't it? – Peter Cordes Jan 11 '21 at 18:03
@PeterCordes I see. Will it actively load the destinations into cache or does it just leave what was there intact? I.e memcpy on hot src/dst with ```rep movsb``` src and dst will both stay in cores caches. will memcpy on cold src/dst with ```rep movsb``` not load either or them or will it load both (pretty sure the latter, just want to verify as ```rep movsb``` seems different from both ```movaps``` and ```movnt```). As a side note does ```vmodqa zmm``` also bypass the RFO and just invalidate or is ```rep movsb``` special? – Noah Jan 11 '21 at 18:54
1

@Noah: it should be obvious that after any store, the cache line will definitely not *still* be hot in some other core's private cache. There's no shared bus for a core to broadcast the new data on (instead it's directory-based coherence with L3 tags or similar structure as the directory). The storing core needs exclusive ownership before updating its own L1d, by invalidating other copies, and has to wait for an acknowledgement of the invalidation. It has to maintain coherence if 2 cores triy to `rep movsb` to the same destination at once. – Peter Cordes Jan 12 '21 at 02:46
1

@Noah: Re: full-line ZMM stores avoiding an RFO: good question, I don't know but it's 100% possible. Internally it could work exactly like a full-line store from rep stos / rep movs. It's something I've wondered, but I forget if I ever found an answer, or what it was for different microarchitectures. (It's an optimization that can of course be added to a later design if SKX or KNL didn't have it.) There might be some reason it's only worth it for a long stream of stores, like somehow taking longer to do something, maybe delaying later stores and stalling the store buffer. – Peter Cordes Jan 12 '21 at 02:50
1

@Noah: Forgot to mention: `rep movs` / `rep stos` might even be adaptive in strategy with some large-size cutoff, like maybe using actual NT stores that bypass cache for very large stores. Or doing something simpler for small copies that only touch a couple lines. But more microcode branching could increase startup overhead so they wouldn't do that without good reason, but it's possible and something to keep in mind if trying to figure out what they do with experiments with perf counters and small to medium copies. – Peter Cordes Jan 12 '21 at 02:53
@PeterCordes re: "it should be obvious that after any store, the cache line will definitely not still be hot in some other core's private cache" I meant in the core doing the memcpy. But your next point that its unknown if the microcode will branch for ```NT``` stores indicates that the affect on the cache of the core doing the memcpy is unknown/dependent on rcx. Ill post a question about the ```zmm``` case (or any case where the store buffer could known a full cache line is being overwritten). – Noah Jan 12 '21 at 04:24
@Noah: Oh, I think I misread your first comment. AFAIK, `rep movsb` doesn't use cache-bypassing loads (because that would not be coherent, and there'd be nowhere to prefetch into; the OoO exec window isn't big enough to hide the full load latency from DRAM). So the microcoded loads are basically just normal loads like `vmovups`. Possibly with something like `prefetchnta` for some prefetch distance to reduce pollution from loads, but I wouldn't bet on it. So after rep movsb, you can expect the end of the source data to be hot in cache, too. – Peter Cordes Jan 12 '21 at 04:53
Microcode can only use uops that the back-end supports, and those uops have to go through the pipeline normally. There isn't (unfortunately) a dedicated memcpy state machine (like a page walker) that `rep movsb` could offload to (a decision that [Andy Glew regretted after the fact](https://stackoverflow.com/a/33905887/224132)). I've been hoping to hear details of Ice Lake's "fast short rep movs" support, whether it's just better microcode or dedicated hardware state-machine that can access cache in parallel with the normal load/store units. – Peter Cordes Jan 12 '21 at 04:53
@PeterCordes hmm I am totally unable to get less RFO requests using ```rep movsb```. With ```vmovntdq``` I see fewer than with ```vmovdqa``` but I reliably see more with ```rep movsb``` than either of the other two. I don't think the issue is ICL as the change [appears to only be for short copied](https://www.phoronix.com/scan.php?page=news_item&px=Intel-5.6-FSRM-Memmove). Made a [godbolt link with my benchmark](https://godbolt.org/z/r741qd). Any idea what I'm messing up? – Noah Feb 19 '21 at 08:06
@Noah: IDK, seems strange. Ask a new question. – Peter Cordes Feb 19 '21 at 08:27
1

@PeterCordes [question](https://stackoverflow.com/questions/66274948/why-am-i-seeing-more-rfo-read-for-ownership-requests-using-rep-movsb-than-with) if you have any ideas :P . Also in doing these tests I didn't see any reduction in RFO requests when using ```zmm``` registers (i.e temporal store with ```zmm``` get same number of RFO as temporal stores with ```ymm``` / ```xmm``` and likewise for non-temporal stores) – Noah Feb 19 '21 at 09:32

Zulan · Answer 2 · 2017-07-11T14:01:02.957

I'll share my preliminary findings, in the hope to encourage more detailed answers. I just felt this would be too much as part of the question itself.

The compiler optimizes fill(0) to a internal memset. It cannot do the same for fill(1), since memset only works on bytes.

Specifically, both glibcs __memset_avx2 and __intel_avx_rep_memset are implemented with a single hot instruction:

rep    stos %al,%es:(%rdi)

Wheres the manual loop compiles down to an actual 128-bit instruction:

add    $0x1,%rax                                                                                                       
add    $0x10,%rdx                                                                                                      
movaps %xmm0,-0x10(%rdx)                                                                                               
cmp    %rax,%r8                                                                                                        
ja     400f41

Interestingly while there is a template/header optimization to implement std::fill via memset for byte types, but in this case it is a compiler optimization to transform the actual loop. Strangely,for a std::vector<char>, gcc begins to optimize also fill(1). The Intel compiler does not, despite the memset template specification.

Since this happens only when the code is actually working in memory rather than cache, makes it appears the Haswell-EP architecture fails to efficiently consolidate the single byte writes.

I would appreciate any further insight into the issue and the related micro-architecture details. In particular it is unclear to me why this behaves so differently for four or more threads and why memset is so much faster in cache.

Update:

Here is a result in comparison with

fill(1) that uses -march=native (avx2 vmovdq %ymm0) - it works better in L1, but similar to the movaps %xmm0 version for other memory levels.
Variants of 32, 128 and 256 bit non-temporal stores. They perform consistently with the same performance regardless of the data size. All outperform the other variants in memory, especially for small numbers of threads. 128 bit and 256 bit perform exactly similar, for low numbers of threads 32 bit performs significantly worse.

For <= 6 thread, vmovnt has a 2x advantage over rep stos when operating in memory.

Single threaded bandwidth:

Aggregate bandwidth in memory:

Here is the code used for the additional tests with their respective hot-loops:

void __attribute__ ((noinline)) fill1(vector& v) {
    std::fill(v.begin(), v.end(), 1);
}
┌─→add    $0x1,%rax
│  vmovdq %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rdi,%rax
└──jb     e0


void __attribute__ ((noinline)) fill1_nt_si32(vector& v) {
    for (auto& elem : v) {
       _mm_stream_si32(&elem, 1);
    }
}
┌─→movnti %ecx,(%rax)
│  add    $0x4,%rax
│  cmp    %rdx,%rax
└──jne    18


void __attribute__ ((noinline)) fill1_nt_si128(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m128i buf = _mm_set1_epi32(1);
    size_t i;
    int* data;
    int* end4 = &v[v.size() - (v.size() % 4)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end4; data += 4) {
        _mm_stream_si128((__m128i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %xmm0,(%rdx)
│  add    $0x10,%rdx
│  cmp    %rcx,%rdx
└──jb     40


void __attribute__ ((noinline)) fill1_nt_si256(vector& v) {
    assert((long)v.data() % 32 == 0); // alignment
    const __m256i buf = _mm256_set1_epi32(1);
    size_t i;
    int* data;
    int* end8 = &v[v.size() - (v.size() % 8)];
    int* end = &v[v.size()];
    for (data = v.data(); data < end8; data += 8) {
        _mm256_stream_si256((__m256i*)data, buf);
    }
    for (; data < end; data++) {
        *data = 1;
    }
}
┌─→vmovnt %ymm0,(%rdx)
│  add    $0x20,%rdx
│  cmp    %rcx,%rdx
└──jb     40

Note: I had to do manual pointer calculation in order to get the loops so compact. Otherwise it would do vector indexing within the loop, probably due to the intrinsic confusing the optimizer.

`rep stos` **is microcoded** in most CPUs (find "REP STOS" and its "Fused µOps column" in http://www.agner.org/optimize/instruction_tables.pdf tables of Haswell around page 189). Also check CPUID EAX=7, EBX, bit 9 "erms Enhanced REP MOVSB/STOSB" (`grep erms /proc/cpuinfo`) which is flag of additionally optimized microcode for `rep stos` since Nehalem: http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf "2.5.6 REP String Enhancement" & 3.7.6 ERMSB. You should compare PMU counters to get some information about implementation. — osgx, Mar 12 '17 at 11:46
Also, check http://stackoverflow.com/a/26256216 for different optimized memcpy/set (and limits of CPU) and try to ask specific questions on https://software.intel.com/en-us/forums to get some attention from https://software.intel.com/en-us/user/545611. The actual microcode of Haswell may have some problems in NUMA case with coherency protocol, when some of the memory is allocated in memory of different numa node (socket) or memory just can be allocated on other node, so multi-socket coherency protocol is active when cachelines are allocated. Also check errata of Haswell about its microcode. — osgx, Mar 12 '17 at 11:55
Sometimes there are authors of `rep s*` microcode in intel forums: https://software.intel.com/en-us/forums/intel-visual-fortran-compiler-for-windows/topic/275765 "Seth Abraham (Intel) Fri, 08/04/2006": "*It is still possible to write code that is faster still, but the performance gap is not as large, and it is a little harder than it used to be to beat REP MOVSD/STOSD... You can still beat REP MOVSD/STOSD with such code*". It can be interesting to rewrite your `fill(1)` case by hand with `rep stosd` and compare speed with rep mov. Also: where does your vector allocates its memory, using mmap? — osgx, Mar 12 '17 at 12:00
Smaller sizes of `vector v` may be allocated in stack (up to and including 131072 bytes) which is wrong for NUMA; and bigger vectors are probably allocated by mmap which is only correct way for NUMA. When the memory page is accessed in first time for writing, it will be allocated on some NUMA node or another. Always write to memory from the same NUMA node where you will work with it. For stack your memory placement may be from previous iteration of bench, which is incorrect for other size of bench. Same can be true for some sizes of malloc, when glibc does not return memory back to OS. — osgx, Mar 12 '17 at 12:42
@osgx thank you for the excellent input. Yes the CPU supports *erms*. I'm quite sure that `std::vector` allocates though `malloc` / `mmap`. Since the vector is only declared thread private, and the thread pinned, it will be both allocated and first-touched by the NUMA node eventually uses it. I would strongly hope that the (thread) stack is also allocated on the NUMA node that runs the thread. — Zulan, Mar 21 '17 at 12:18
Welcome to the NUMA world. vector is allocated with `malloc`, used correctly with first touch placing, but its deallocation with `free` will just mark memory as unused, *without returning memory back to OS* - there will be no next touch for next iteration (some outdated info on malloc in http://stackoverflow.com/questions/2215259/ and some in http://stackoverflow.com/a/42281428 "Since 2007 (glibc 2.9 and newer)"). With glibc **do call `malloc_trim()`** between `bench` and the freed memory will be marked as free to OS and retouched for NUMA. Stack is allocated by main thread... — osgx, Mar 21 '17 at 13:26
@osgx Adding `malloc_trim()` after each `bench` did not result in any significant changes of the performance. I don't see any effects in my results that indicate trouble with NUMA. Even if it was the case, then `fill(0)` and `fill(1)` would be affected the same way! Consider the single socket results in [this chart](https://i.stack.imgur.com/kF8WE.png) (up to 12 threads). — Zulan, Mar 21 '17 at 20:07
Microcoded `fill(0)` still slower than manual loop of `fill(1)`. There is still NUMA cache coherency (even when NUMA memory placement is correct) which can make microcoded variant slower, can you rerun code not on NUMA machine (when second socket is disabled / or not present)? — osgx, Mar 21 '17 at 20:11
Would `numactl --membind=0 --cpunodebind=0` suffice? Can't really disable a socket on these systems. — Zulan, Mar 21 '17 at 20:21
Zulan, no, software will not disable cache coherency between sockets (second socket should not be booted/QPI disabled). Your E5-2680 v3 is 12 core haswell in MCC (Medium Core Count) die (http://www.anandtech.com/show/8679/intel-haswellep-xeon-12-core-review-e5-2650l-v3-and-e5-2690-v3) and there is cache snooping messages on access: http://frankdenneman.nl/2016/07/11/numa-deep-dive-part-3-cache-coherency/. They are sent both in the ring of local socket and over QPI to next socket. Some versions of Xeons may use "directory" to limit snooping message storms in memory-bound tasks like this one. — osgx, Mar 21 '17 at 20:30
You can also check Intel MLC - https://software.intel.com/en-us/articles/intelr-memory-latency-checker for measuring maximal bandwidth of the tested systems as `mlc --bandwidth_matrix` and `mlc --peak_bandwidth`. Also - paper about your Haswell and its cache coherency https://tu-dresden.de/zih/forschung/ressourcen/dateien/abgeschlossene-projekte/benchit/2015_ICPP_authors_version.pdf?lang=en — osgx, Mar 21 '17 at 20:41

Why is std::fill(0) slower than std::fill(1)?

2 Answers2

Linked