Why am I seeing more RFO (Read For Ownership) requests using REP MOVSB than with vmovdqa

Question

Checkout Edit3

I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb as compared to Temporal Store memcpy because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa with a zmm register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb has reasonable performance.

Edit: It is possible that the RFO requests from rep movsb for different those those for vmovdqa in that for rep movsb it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm register. I don't see any perf metrics to test this however. Does anyone know any?

Questions

Why am I not seeing a reduction in RFO requests when I use rep movsb for memcpy as compared to a memcpy implemented with vmovdqa?
Why am I seeing more RFO requests when I used rep movsb for memcpy as compared to a memcpy implemented with vmovdqa

Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb, but if that is not the case, should I be seeing an increase as well?

Background

CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

I was trying to test out the number of RFO requests when using different methods of memcpy including:

Temporal Stores -> vmovdqa
Non-Temporal Stores -> vmovntdq
Enhanced REP MOVSB -> rep movsb

And have been unable to see a reduction in RFO requests using rep movsb. In fact I have been seeing more RFO requests with rep movsb than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb is able to avoid RFO requests and in turn save memory bandwidth:

In Enhanced REP MOVSB for memcpy:

When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:

Avoiding the RFO request when it knows the entire cache line will be overwritten.

In What's missing/sub-optimal in this memcpy implementation?:

Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not

I wrote a simple test program to verify this but was unable to do so.

Test Program

#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TEMPORAL          0
#define NON_TEMPORAL      1
#define REP_MOVSB         2
#define NONE_OF_THE_ABOVE 3

#define TODO 1

#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif

#define load(x)     _mm256_load_si256((__m256i *)(x))

void *
mmapw(uint64_t sz) {
    void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    assert(p != NULL);
    return p;
}
void BENCH_ATTR
bench() {
    uint64_t len = 64UL * (1UL << 22);

    uint64_t len_alloc = len;
    char *   dst_alloc = (char *)mmapw(len);
    char *   src_alloc = (char *)mmapw(len);

    for (uint64_t i = 0; i < len; i += 4096) {
        // page in before testing. perf metrics appear to still come through
        dst_alloc[i] = 0;
        src_alloc[i] = 0;
    }

    uint64_t dst     = (uint64_t)dst_alloc;
    uint64_t src     = (uint64_t)src_alloc;
    uint64_t dst_end = dst + len;



    asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
    // test rep movsb
    asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
    // test vmovtndq or vmovdqa
    for (; dst < dst_end;) {
        __m256i lo = load(src);
        __m256i hi = load(src + 32);
        store(dst, lo);
        store(dst + 32, hi);
        dst += 64;
        src += 64;
    }
#endif

    asm volatile("lfence\n\tmfence" : : : "memory");

    assert(!munmap(dst_alloc, len_alloc));
    assert(!munmap(src_alloc, len_alloc));
}

int
main(int argc, char ** argv) {
    bench();
}

Build (assuming file name is rfo_test.c):

gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test

Run (assuming executable is rfo_test):

perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Test Data

Note: Data with less noise in edit2

TODO = TEMPORAL

       583,912,867      cpu-cycles
         9,352,817      l2_rqsts.all_rfo
       188,343,479      offcore_requests_outstanding.cycles_with_demand_rfo
        11,560,370      offcore_requests.demand_rfo

       0.166557783 seconds time elapsed

       0.044670000 seconds user
       0.121828000 seconds sys

TODO = NON_TEMPORAL

       560,933,296      cpu-cycles
         7,428,210      l2_rqsts.all_rfo
       123,174,665      offcore_requests_outstanding.cycles_with_demand_rfo
         8,402,627      offcore_requests.demand_rfo

       0.156790873 seconds time elapsed

       0.032157000 seconds user
       0.124608000 seconds sys

TODO = REP_MOVSB

       566,898,220      cpu-cycles
        11,626,162      l2_rqsts.all_rfo
       178,043,659      offcore_requests_outstanding.cycles_with_demand_rfo
        12,611,324      offcore_requests.demand_rfo

       0.163038739 seconds time elapsed

       0.040749000 seconds user
       0.122248000 seconds sys

TODO = NONE_OF_THE_ABOVE

       521,061,304      cpu-cycles
         7,527,122      l2_rqsts.all_rfo
       123,132,321      offcore_requests_outstanding.cycles_with_demand_rfo
         8,426,613      offcore_requests.demand_rfo

       0.139873929 seconds time elapsed

       0.007991000 seconds user
       0.131854000 seconds sys

Test Results

The baseline RFO requests with just the setup but without the memcpy is in TODO = NONE_OF_THE_ABOVE with 7,527,122 RFO requests.

With TODO = TEMPORAL (using vmovdqa) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB (using rep movsb) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL (using vmovntdq) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy itself.

I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL < TEMPORAL < REP_MOVSB.

Theories

[Unlikely] Something new on Icelake?

Edit: @PeterCordes was able to reproduc the results on Skylake

I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb for Icelake are:

Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.

Which should not be playing a factor in the test program I am using given that len is well above 128.

[Likelier] My test program is broken

I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here

Edit: Fixed build instructions to use G++ instead of GCC and file postfix from .c to .cc

Edit2:

Back to C and GCC.

Better Pref Recipe:

perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Numbers with better perf recipe (same trend but less noise):

TODO = TEMPORAL

       161,214,341      cpu-cycles                                                  
         1,984,998      l2_rqsts.all_rfo                                            
        61,238,129      offcore_requests_outstanding.cycles_with_demand_rfo                                   
         3,161,504      offcore_requests.demand_rfo                                   

       0.169413413 seconds time elapsed

       0.044371000 seconds user
       0.125045000 seconds sys

TODO = NON_TEMPORAL

       142,689,742      cpu-cycles                                                  
             3,106      l2_rqsts.all_rfo                                            
             4,581      offcore_requests_outstanding.cycles_with_demand_rfo                                   
                30      offcore_requests.demand_rfo                                   

       0.166300952 seconds time elapsed

       0.032462000 seconds user
       0.133907000 seconds sys

TODO = REP_MOVSB

       150,630,752      cpu-cycles                                                  
         4,194,202      l2_rqsts.all_rfo                                            
        54,764,929      offcore_requests_outstanding.cycles_with_demand_rfo                                   
         4,194,016      offcore_requests.demand_rfo                                   

       0.166844489 seconds time elapsed

       0.036620000 seconds user
       0.130205000 seconds sys

TODO = NONE_OF_THE_ABOVE

        89,611,571      cpu-cycles                                                  
               321      l2_rqsts.all_rfo                                            
             3,936      offcore_requests_outstanding.cycles_with_demand_rfo                                   
                19      offcore_requests.demand_rfo                                   

       0.142347046 seconds time elapsed

       0.016264000 seconds user
       0.126046000 seconds sys

Edit3: This may have to do with hiding RFO events triggered by the L2 Prefetcher

I used the pref recipe @BeeOnRope made that include RFO events started by the L2 Prefetcher:

perf stat --all-user -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ ./rfo_test

And the equivilent perf recipe without L2 Prefetch events:

perf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ ./rfo_test

And got more reasonable results:

Tl;dr; w/ prefetching numbers we see less RFO requests with rep movsb. But it does not appear that rep movsb actually avoids RFO requests, rather it just touch less cache lines

Data With and Without Prefetch Triggered Events Included

TODO =	Perf Event	w/ Prefetching	w/o Prefetching	Difference
----------------------	----------------------	----------------------	----------------------	----------------------
TEMPORAL	l2_rqsts_references	16812993	4358692	12454301
TEMPORAL	l2_rqsts_all_rfo	14443392	1981560	12461832
TEMPORAL	l2_rqsts_rfo_hit	1297932	1038243	259689
TEMPORAL	l2_rqsts_rfo_miss	13145460	943317	12202143
----------------------	----------------------	----------------------	----------------------	----------------------
NON_TEMPORAL	l2_rqsts_references	8820287	1946591	6873696
NON_TEMPORAL	l2_rqsts_all_rfo	6852605	346	6852259
NON_TEMPORAL	l2_rqsts_rfo_hit	66845	317	66528
NON_TEMPORAL	l2_rqsts_rfo_miss	6785760	29	6785731
----------------------	----------------------	----------------------	----------------------	----------------------
REP_MOVSB	l2_rqsts_references	11856549	7400277	4456272
REP_MOVSB	l2_rqsts_all_rfo	8633330	4194510	4438820
REP_MOVSB	l2_rqsts_rfo_hit	1394372	546	1393826
REP_MOVSB	l2_rqsts_rfo_miss	7238958	4193964	3044994
----------------------	----------------------	----------------------	----------------------	----------------------
LOAD_ONLY_TEMPORAL	l2_rqsts_references	6058269	619924	5438345
LOAD_ONLY_TEMPORAL	l2_rqsts_all_rfo	5103905	337	5103568
LOAD_ONLY_TEMPORAL	l2_rqsts_rfo_hit	438518	311	438207
LOAD_ONLY_TEMPORAL	l2_rqsts_rfo_miss	4665387	26	4665361
----------------------	----------------------	----------------------	----------------------	----------------------
STORE_ONLY_TEMPORAL	l2_rqsts_references	8069068	837616	7231452
STORE_ONLY_TEMPORAL	l2_rqsts_all_rfo	8033854	802969	7230885
STORE_ONLY_TEMPORAL	l2_rqsts_rfo_hit	585938	576955	8983
STORE_ONLY_TEMPORAL	l2_rqsts_rfo_miss	7447916	226014	7221902
----------------------	----------------------	----------------------	----------------------	----------------------
STORE_ONLY_REP_STOSB	l2_rqsts_references	4296169	4228643	67526
STORE_ONLY_REP_STOSB	l2_rqsts_all_rfo	4261756	4194548	67208
STORE_ONLY_REP_STOSB	l2_rqsts_rfo_hit	17337	309	17028
STORE_ONLY_REP_STOSB	l2_rqsts_rfo_miss	4244419	4194239	50180
----------------------	----------------------	----------------------	----------------------	----------------------
STORE_ONLY_NON_TEMPORAL	l2_rqsts_references	99713	36112	63601
STORE_ONLY_NON_TEMPORAL	l2_rqsts_all_rfo	64148	427	63721
STORE_ONLY_NON_TEMPORAL	l2_rqsts_rfo_hit	17091	398	16693
STORE_ONLY_NON_TEMPORAL	l2_rqsts_rfo_miss	47057	29	47028
----------------------	----------------------	----------------------	----------------------	----------------------
NONE_OF_THE_ABOVE	l2_rqsts_references	74074	27656	46418
NONE_OF_THE_ABOVE	l2_rqsts_all_rfo	46833	375	46458
NONE_OF_THE_ABOVE	l2_rqsts_rfo_hit	16366	344	16022
NONE_OF_THE_ABOVE	l2_rqsts_rfo_miss	30467	31	30436

It seems most of the RFO differences boil down to prefetching Enhanced REP MOVSB for memcpy

Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.

Stores

It all appears to come down to rep movsb not prefetching store addresses causing less lines to require an RFO request. With STORE_ONLY_REP_STOSB we can get a better idea of where the RFO requests are saved with rep movsb (assuming the two are implemented simliarly). With Prefetching events NOT counted, we see rep movsb having about the exact same number of RFO requests as rep stosb (and same breakdown of HITS / MISSES). It has about ~2.5 million extra L2 references which are fair to attribute to the loads.

Whats especially interesting for the STORE_ONLY_REP_STOSB numbers is that they barely change with prefetch vs non-prefetch data. This makes me think that rep stosb at the very least is NOT prefetching the store address. This also corresponds with the fact that we see almost no RFO_HITS and almost entirely RFO_MISSES. Temporal Store memcpy, on the otherhand IS prefetching the store address so the origional numbers where skewed in that they didn't count the store RFO requests from vmovdqa but counted all of them from rep movsb.

Another pointer of interest is that STORE_ONLY_REP_STOSB still has many RFO requests compared with STORE_ONLY_NON_TEMORAL. This makes me think rep movsb/rep stosb is only saving RFO requests on stores because it is not making extra prefetches but it is using a temporal store that goes through cache. One thing I am having a hard time reconcilling is it seems the stores from rep movsb / rep stosb neither prefetch not use non-temporal stores that include an RFO so I am unsure how it has comparable performance.

Loads

I think rep movsb is prefetching loads and it is doing a better job of it that standard vmovdqa loop. If you look at the diff between rep movsb w/ and w/o prefetch and the diff for LOAD_ONLY_TEMPORAL you see about the same pattern with the numbers of LOAD_ONLY_TEMPORAL being about 20% higher for references but lower for hits. This would indicate the the vmovdqa loop is doing extra prefetches past the tail and prefetching less effectively. So rep movsb does a better job prefetching the load address (thus less total references and higher hit rate).

Results

The following is what I am thinking from the data:

rep movsb does NOT optimize out RFO requests for a given load/store
- Maybe its a different type of RFO request that does not require data to be sent but have been unable to find a counter to test this.
rep movsb does not prefetch stores and does not use non-temporal stores. It thus uses less RFO requests for stores because it doesn't pull in unnecissary lines with prefetching.
- Possible it is expecting the store buffer to hide the latency from getting the lines into cache as it knows that there is never a dependency on the stored value.
- Possible that the heuristic is a false invalidation of another cores data is too expensive so it doesn't want to prefetch lines for E/M state.
- I have a hard time reconciling this with "good performance"
rep movsb is prefetching loads and does so better than a normal temporal load loop.

Edit4:

Using new perf recipe to measure uncore reads / writes:

perf stat -a -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" ./rfo_test

The idea is the if rep stosb is send RFO-ND then it should have about the same numbers as movntdq. This seems to be the case.

TODO = STORE_ONLY_REP_STOSB

        24,251,861      data_reads                                                  
        52,130,870      data_writes

TODO = STORE_ONLY_TEMPORAL
- Note: this is done with vmovdqa ymm, (%reg). This is not a 64 byte store so an RFO w/ data should be necessary. I did test this with vmodqa32 zmm, (%reg) and saw about the same numbers. That means either 1) zmm stores are not optimized to skip the RFO in favor of an ItoM, or 2) these events are not indicative of what I think they are Beware.

        39,785,140      data_reads                                                  
        35,225,418      data_writes

TODO = STORE_ONLY_NON_TEMPORAL

        22,680,373      data_reads                                                  
        51,057,807      data_writes

One thing that is strange is that while reads are lower for STORE_ONLY_NON_TEMPORAL and STORE_ONLY_REP_STOSB writes are higher for both of them.

There is a real name of RFO-ND; ItoM.

RFO: For writes to part of cache line. If in 'I' needs to have data forwarded to it.
ItoM: For writes to full cache line. If in 'I' does NOT need data forwarded to it.

Its aggregated with RFO in OFFCORE_REQUESTS.DEMAND_RFO. Intel has a performance tool that seems sample its value from MSR but they don't have support for ICL and so far am having trouble finding documentation for ICL. Need to investigate more into how to isolate it.

Edit5: The reason for less writes with STORE_ONLY_TEMPORAL earlier was zero store elimination.

One of this issue with my measurement method is the uncore_imc events arent supported with the all-user option. I changed up the perf recipe a bit to try and mitigate this:

perf stat -D 1000 -C 0 -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" taskset -c 0  ./rfo_test

I pin rfo_test to core 0 and only collect stats on core 0. As well I only start collecting stats after the first second and usleep in the benchmark until the 1 second mark after setup has completed. Still some noise to I included NONE_OF_THE_ABOVE which is just the perf numbers from setup / teardown of the benchmark.

TODO = STORE_ONLY_REP_STOSB

         2,951,318      data_reads                                                  
        18,034,260      data_writes

TODO = STORE_ONLY_TEMPORAL

        20,021,299      data_reads                                                  
        18,048,681      data_writes

TODO = STORE_ONLY_NON_TEMPORAL

         2,876,755      data_reads                                                  
        18,030,816      data_writes

TODO = NONE_OF_THE_ABOVE

         2,942,999      data_reads                                                  
         1,274,211      data_writes

Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/229315/discussion-on-question-by-noah-why-am-i-seeing-more-rfo-read-for-ownership-req). — Machavity, Feb 28 '21 at 13:53