6

Checkout Edit3

I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb as compared to Temporal Store memcpy because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa with a zmm register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb has reasonable performance.

Edit: It is possible that the RFO requests from rep movsb for different those those for vmovdqa in that for rep movsb it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm register. I don't see any perf metrics to test this however. Does anyone know any?

Questions

  1. Why am I not seeing a reduction in RFO requests when I use rep movsb for memcpy as compared to a memcpy implemented with vmovdqa?
  2. Why am I seeing more RFO requests when I used rep movsb for memcpy as compared to a memcpy implemented with vmovdqa

Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb, but if that is not the case, should I be seeing an increase as well?

Background

CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz

I was trying to test out the number of RFO requests when using different methods of memcpy including:

  • Temporal Stores -> vmovdqa
  • Non-Temporal Stores -> vmovntdq
  • Enhanced REP MOVSB -> rep movsb

And have been unable to see a reduction in RFO requests using rep movsb. In fact I have been seeing more RFO requests with rep movsb than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb is able to avoid RFO requests and in turn save memory bandwidth:

When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:

  • Avoiding the RFO request when it knows the entire cache line will be overwritten.

Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not

I wrote a simple test program to verify this but was unable to do so.

Test Program

#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>

#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))


#define TEMPORAL          0
#define NON_TEMPORAL      1
#define REP_MOVSB         2
#define NONE_OF_THE_ABOVE 3

#define TODO 1

#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif

#define load(x)     _mm256_load_si256((__m256i *)(x))

void *
mmapw(uint64_t sz) {
    void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
                    MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
    assert(p != NULL);
    return p;
}
void BENCH_ATTR
bench() {
    uint64_t len = 64UL * (1UL << 22);

    uint64_t len_alloc = len;
    char *   dst_alloc = (char *)mmapw(len);
    char *   src_alloc = (char *)mmapw(len);

    for (uint64_t i = 0; i < len; i += 4096) {
        // page in before testing. perf metrics appear to still come through
        dst_alloc[i] = 0;
        src_alloc[i] = 0;
    }

    uint64_t dst     = (uint64_t)dst_alloc;
    uint64_t src     = (uint64_t)src_alloc;
    uint64_t dst_end = dst + len;



    asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
    // test rep movsb
    asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
    // test vmovtndq or vmovdqa
    for (; dst < dst_end;) {
        __m256i lo = load(src);
        __m256i hi = load(src + 32);
        store(dst, lo);
        store(dst + 32, hi);
        dst += 64;
        src += 64;
    }
#endif

    asm volatile("lfence\n\tmfence" : : : "memory");

    assert(!munmap(dst_alloc, len_alloc));
    assert(!munmap(src_alloc, len_alloc));
}

int
main(int argc, char ** argv) {
    bench();
}

  • Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
  • Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Test Data

Note: Data with less noise in edit2

  • TODO = TEMPORAL
       583,912,867      cpu-cycles
         9,352,817      l2_rqsts.all_rfo
       188,343,479      offcore_requests_outstanding.cycles_with_demand_rfo
        11,560,370      offcore_requests.demand_rfo

       0.166557783 seconds time elapsed

       0.044670000 seconds user
       0.121828000 seconds sys
  • TODO = NON_TEMPORAL
       560,933,296      cpu-cycles
         7,428,210      l2_rqsts.all_rfo
       123,174,665      offcore_requests_outstanding.cycles_with_demand_rfo
         8,402,627      offcore_requests.demand_rfo

       0.156790873 seconds time elapsed

       0.032157000 seconds user
       0.124608000 seconds sys
  • TODO = REP_MOVSB
       566,898,220      cpu-cycles
        11,626,162      l2_rqsts.all_rfo
       178,043,659      offcore_requests_outstanding.cycles_with_demand_rfo
        12,611,324      offcore_requests.demand_rfo

       0.163038739 seconds time elapsed

       0.040749000 seconds user
       0.122248000 seconds sys
  • TODO = NONE_OF_THE_ABOVE
       521,061,304      cpu-cycles
         7,527,122      l2_rqsts.all_rfo
       123,132,321      offcore_requests_outstanding.cycles_with_demand_rfo
         8,426,613      offcore_requests.demand_rfo

       0.139873929 seconds time elapsed

       0.007991000 seconds user
       0.131854000 seconds sys

Test Results

The baseline RFO requests with just the setup but without the memcpy is in TODO = NONE_OF_THE_ABOVE with 7,527,122 RFO requests.

With TODO = TEMPORAL (using vmovdqa) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB (using rep movsb) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL (using vmovntdq) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy itself.

I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL < TEMPORAL < REP_MOVSB.

Theories

  • [Unlikely] Something new on Icelake?

Edit: @PeterCordes was able to reproduc the results on Skylake

I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb for Icelake are:

Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.

Which should not be playing a factor in the test program I am using given that len is well above 128.

  • [Likelier] My test program is broken

I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here

Edit: Fixed build instructions to use G++ instead of GCC and file postfix from .c to .cc

Edit2:

Back to C and GCC.

  • Better Pref Recipe:
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test

Numbers with better perf recipe (same trend but less noise):

  • TODO = TEMPORAL
       161,214,341      cpu-cycles                                                  
         1,984,998      l2_rqsts.all_rfo                                            
        61,238,129      offcore_requests_outstanding.cycles_with_demand_rfo                                   
         3,161,504      offcore_requests.demand_rfo                                   

       0.169413413 seconds time elapsed

       0.044371000 seconds user
       0.125045000 seconds sys
  • TODO = NON_TEMPORAL
       142,689,742      cpu-cycles                                                  
             3,106      l2_rqsts.all_rfo                                            
             4,581      offcore_requests_outstanding.cycles_with_demand_rfo                                   
                30      offcore_requests.demand_rfo                                   

       0.166300952 seconds time elapsed

       0.032462000 seconds user
       0.133907000 seconds sys
  • TODO = REP_MOVSB
       150,630,752      cpu-cycles                                                  
         4,194,202      l2_rqsts.all_rfo                                            
        54,764,929      offcore_requests_outstanding.cycles_with_demand_rfo                                   
         4,194,016      offcore_requests.demand_rfo                                   

       0.166844489 seconds time elapsed

       0.036620000 seconds user
       0.130205000 seconds sys
  • TODO = NONE_OF_THE_ABOVE
        89,611,571      cpu-cycles                                                  
               321      l2_rqsts.all_rfo                                            
             3,936      offcore_requests_outstanding.cycles_with_demand_rfo                                   
                19      offcore_requests.demand_rfo                                   

       0.142347046 seconds time elapsed

       0.016264000 seconds user
       0.126046000 seconds sys

Edit3: This may have to do with hiding RFO events triggered by the L2 Prefetcher

I used the pref recipe @BeeOnRope made that include RFO events started by the L2 Prefetcher:

perf stat --all-user -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ ./rfo_test

And the equivilent perf recipe without L2 Prefetch events:

perf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ ./rfo_test

And got more reasonable results:

Tl;dr; w/ prefetching numbers we see less RFO requests with rep movsb. But it does not appear that rep movsb actually avoids RFO requests, rather it just touch less cache lines

Data With and Without Prefetch Triggered Events Included

TODO = Perf Event w/ Prefetching w/o Prefetching Difference
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
TEMPORAL l2_rqsts_references 16812993 4358692 12454301
TEMPORAL l2_rqsts_all_rfo 14443392 1981560 12461832
TEMPORAL l2_rqsts_rfo_hit 1297932 1038243 259689
TEMPORAL l2_rqsts_rfo_miss 13145460 943317 12202143
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
NON_TEMPORAL l2_rqsts_references 8820287 1946591 6873696
NON_TEMPORAL l2_rqsts_all_rfo 6852605 346 6852259
NON_TEMPORAL l2_rqsts_rfo_hit 66845 317 66528
NON_TEMPORAL l2_rqsts_rfo_miss 6785760 29 6785731
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
REP_MOVSB l2_rqsts_references 11856549 7400277 4456272
REP_MOVSB l2_rqsts_all_rfo 8633330 4194510 4438820
REP_MOVSB l2_rqsts_rfo_hit 1394372 546 1393826
REP_MOVSB l2_rqsts_rfo_miss 7238958 4193964 3044994
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
LOAD_ONLY_TEMPORAL l2_rqsts_references 6058269 619924 5438345
LOAD_ONLY_TEMPORAL l2_rqsts_all_rfo 5103905 337 5103568
LOAD_ONLY_TEMPORAL l2_rqsts_rfo_hit 438518 311 438207
LOAD_ONLY_TEMPORAL l2_rqsts_rfo_miss 4665387 26 4665361
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
STORE_ONLY_TEMPORAL l2_rqsts_references 8069068 837616 7231452
STORE_ONLY_TEMPORAL l2_rqsts_all_rfo 8033854 802969 7230885
STORE_ONLY_TEMPORAL l2_rqsts_rfo_hit 585938 576955 8983
STORE_ONLY_TEMPORAL l2_rqsts_rfo_miss 7447916 226014 7221902
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
STORE_ONLY_REP_STOSB l2_rqsts_references 4296169 4228643 67526
STORE_ONLY_REP_STOSB l2_rqsts_all_rfo 4261756 4194548 67208
STORE_ONLY_REP_STOSB l2_rqsts_rfo_hit 17337 309 17028
STORE_ONLY_REP_STOSB l2_rqsts_rfo_miss 4244419 4194239 50180
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
STORE_ONLY_NON_TEMPORAL l2_rqsts_references 99713 36112 63601
STORE_ONLY_NON_TEMPORAL l2_rqsts_all_rfo 64148 427 63721
STORE_ONLY_NON_TEMPORAL l2_rqsts_rfo_hit 17091 398 16693
STORE_ONLY_NON_TEMPORAL l2_rqsts_rfo_miss 47057 29 47028
---------------------- ---------------------- ---------------------- ---------------------- ----------------------
NONE_OF_THE_ABOVE l2_rqsts_references 74074 27656 46418
NONE_OF_THE_ABOVE l2_rqsts_all_rfo 46833 375 46458
NONE_OF_THE_ABOVE l2_rqsts_rfo_hit 16366 344 16022
NONE_OF_THE_ABOVE l2_rqsts_rfo_miss 30467 31 30436

It seems most of the RFO differences boil down to prefetching Enhanced REP MOVSB for memcpy

Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.

Stores

It all appears to come down to rep movsb not prefetching store addresses causing less lines to require an RFO request. With STORE_ONLY_REP_STOSB we can get a better idea of where the RFO requests are saved with rep movsb (assuming the two are implemented simliarly). With Prefetching events NOT counted, we see rep movsb having about the exact same number of RFO requests as rep stosb (and same breakdown of HITS / MISSES). It has about ~2.5 million extra L2 references which are fair to attribute to the loads.

Whats especially interesting for the STORE_ONLY_REP_STOSB numbers is that they barely change with prefetch vs non-prefetch data. This makes me think that rep stosb at the very least is NOT prefetching the store address. This also corresponds with the fact that we see almost no RFO_HITS and almost entirely RFO_MISSES. Temporal Store memcpy, on the otherhand IS prefetching the store address so the origional numbers where skewed in that they didn't count the store RFO requests from vmovdqa but counted all of them from rep movsb.

Another pointer of interest is that STORE_ONLY_REP_STOSB still has many RFO requests compared with STORE_ONLY_NON_TEMORAL. This makes me think rep movsb/rep stosb is only saving RFO requests on stores because it is not making extra prefetches but it is using a temporal store that goes through cache. One thing I am having a hard time reconcilling is it seems the stores from rep movsb / rep stosb neither prefetch not use non-temporal stores that include an RFO so I am unsure how it has comparable performance.

Loads

I think rep movsb is prefetching loads and it is doing a better job of it that standard vmovdqa loop. If you look at the diff between rep movsb w/ and w/o prefetch and the diff for LOAD_ONLY_TEMPORAL you see about the same pattern with the numbers of LOAD_ONLY_TEMPORAL being about 20% higher for references but lower for hits. This would indicate the the vmovdqa loop is doing extra prefetches past the tail and prefetching less effectively. So rep movsb does a better job prefetching the load address (thus less total references and higher hit rate).

Results

The following is what I am thinking from the data:

  • rep movsb does NOT optimize out RFO requests for a given load/store
    • Maybe its a different type of RFO request that does not require data to be sent but have been unable to find a counter to test this.
  • rep movsb does not prefetch stores and does not use non-temporal stores. It thus uses less RFO requests for stores because it doesn't pull in unnecissary lines with prefetching.
    • Possible it is expecting the store buffer to hide the latency from getting the lines into cache as it knows that there is never a dependency on the stored value.
    • Possible that the heuristic is a false invalidation of another cores data is too expensive so it doesn't want to prefetch lines for E/M state.
    • I have a hard time reconciling this with "good performance"
  • rep movsb is prefetching loads and does so better than a normal temporal load loop.

Edit4:

Using new perf recipe to measure uncore reads / writes:

perf stat -a -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" ./rfo_test

The idea is the if rep stosb is send RFO-ND then it should have about the same numbers as movntdq. This seems to be the case.

  • TODO = STORE_ONLY_REP_STOSB
        24,251,861      data_reads                                                  
        52,130,870      data_writes                                                 
  • TODO = STORE_ONLY_TEMPORAL
    • Note: this is done with vmovdqa ymm, (%reg). This is not a 64 byte store so an RFO w/ data should be necessary. I did test this with vmodqa32 zmm, (%reg) and saw about the same numbers. That means either 1) zmm stores are not optimized to skip the RFO in favor of an ItoM, or 2) these events are not indicative of what I think they are Beware.
        39,785,140      data_reads                                                  
        35,225,418      data_writes                                                 
  • TODO = STORE_ONLY_NON_TEMPORAL
        22,680,373      data_reads                                                  
        51,057,807      data_writes                                                 

One thing that is strange is that while reads are lower for STORE_ONLY_NON_TEMPORAL and STORE_ONLY_REP_STOSB writes are higher for both of them.

There is a real name of RFO-ND; ItoM.

  • RFO: For writes to part of cache line. If in 'I' needs to have data forwarded to it.
  • ItoM: For writes to full cache line. If in 'I' does NOT need data forwarded to it.

Its aggregated with RFO in OFFCORE_REQUESTS.DEMAND_RFO. Intel has a performance tool that seems sample its value from MSR but they don't have support for ICL and so far am having trouble finding documentation for ICL. Need to investigate more into how to isolate it.

Edit5: The reason for less writes with STORE_ONLY_TEMPORAL earlier was zero store elimination.

One of this issue with my measurement method is the uncore_imc events arent supported with the all-user option. I changed up the perf recipe a bit to try and mitigate this:

perf stat -D 1000 -C 0 -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" taskset -c 0  ./rfo_test

I pin rfo_test to core 0 and only collect stats on core 0. As well I only start collecting stats after the first second and usleep in the benchmark until the 1 second mark after setup has completed. Still some noise to I included NONE_OF_THE_ABOVE which is just the perf numbers from setup / teardown of the benchmark.

  • TODO = STORE_ONLY_REP_STOSB
         2,951,318      data_reads                                                  
        18,034,260      data_writes
  • TODO = STORE_ONLY_TEMPORAL
        20,021,299      data_reads                                                  
        18,048,681      data_writes
  • TODO = STORE_ONLY_NON_TEMPORAL
         2,876,755      data_reads                                                  
        18,030,816      data_writes
  • TODO = NONE_OF_THE_ABOVE
         2,942,999      data_reads                                                  
         1,274,211      data_writes
Noah
  • 1,647
  • 1
  • 9
  • 18
  • Comments are not for extended discussion; this conversation has been [moved to chat](https://chat.stackoverflow.com/rooms/229315/discussion-on-question-by-noah-why-am-i-seeing-more-rfo-read-for-ownership-req). – Machavity Feb 28 '21 at 13:53

0 Answers0