Checkout Edit3
I was getting the wrong results because I was measuring without including prefetch triggered events as discussed here. That being said AFAIK I am only see a reduction in RFO requests with rep movsb
as compared to Temporal Store memcpy
because of better prefetching on loads and no prefetching on stores. NOT due to RFO requests being optimized out for full cache line stores. This kind of makes sense as we don't see RFO requests optimized out for vmovdqa
with a zmm
register which we would expect if that where really the case for full cache line stores. That being said the lack of prefetching on stores and lack of non-temporal writes makes it hard to see how rep movsb
has reasonable performance.
Edit: It is possible that the RFO requests from rep movsb
for different those those for vmovdqa
in that for rep movsb
it might not request data, just take the line in exclusive state. This could also be the case for stores with a zmm
register. I don't see any perf metrics to test this however. Does anyone know any?
Questions
- Why am I not seeing a reduction in RFO requests when I use
rep movsb
formemcpy
as compared to amemcpy
implemented withvmovdqa
? - Why am I seeing more RFO requests when I used
rep movsb
formemcpy
as compared to amemcpy
implemented withvmovdqa
Two seperate questions because I believe I should be seeing a reduction in RFO requests with rep movsb
, but if that is not the case, should I be seeing an increase as well?
Background
CPU - Icelake: Intel(R) Core(TM) i7-1065G7 CPU @ 1.30GHz
I was trying to test out the number of RFO requests when using different methods of memcpy
including:
- Temporal Stores ->
vmovdqa
- Non-Temporal Stores ->
vmovntdq
- Enhanced REP MOVSB ->
rep movsb
And have been unable to see a reduction in RFO requests using rep movsb
. In fact I have been seeing more RFO requests with rep movsb
than with Temporal Stores. This is counter-intuitive given that the consensus understanding seems be that for ivybridge and new rep movsb
is able to avoid RFO requests and in turn save memory bandwidth:
When a rep movs instruction is issued, the CPU knows that an entire block of a known size is to be transferred. This can help it optimize the operation in a way that it cannot with discrete instructions, for example:
- Avoiding the RFO request when it knows the entire cache line will be overwritten.
Note that on Ivybridge and Haswell, with buffers to large to fit in MLC you can beat movntdqa using rep movsb; movntdqa incurs a RFO into LLC, rep movsb does not
I wrote a simple test program to verify this but was unable to do so.
Test Program
#include <assert.h>
#include <errno.h>
#include <immintrin.h>
#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#define BENCH_ATTR __attribute__((noinline, noclone, aligned(4096)))
#define TEMPORAL 0
#define NON_TEMPORAL 1
#define REP_MOVSB 2
#define NONE_OF_THE_ABOVE 3
#define TODO 1
#if TODO == NON_TEMPORAL
#define store(x, y) _mm256_stream_si256((__m256i *)(x), y)
#else
#define store(x, y) _mm256_store_si256((__m256i *)(x), y)
#endif
#define load(x) _mm256_load_si256((__m256i *)(x))
void *
mmapw(uint64_t sz) {
void * p = mmap(NULL, sz, PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_PRIVATE, -1, 0);
assert(p != NULL);
return p;
}
void BENCH_ATTR
bench() {
uint64_t len = 64UL * (1UL << 22);
uint64_t len_alloc = len;
char * dst_alloc = (char *)mmapw(len);
char * src_alloc = (char *)mmapw(len);
for (uint64_t i = 0; i < len; i += 4096) {
// page in before testing. perf metrics appear to still come through
dst_alloc[i] = 0;
src_alloc[i] = 0;
}
uint64_t dst = (uint64_t)dst_alloc;
uint64_t src = (uint64_t)src_alloc;
uint64_t dst_end = dst + len;
asm volatile("lfence" : : : "memory");
#if TODO == REP_MOVSB
// test rep movsb
asm volatile("rep movsb" : "+D"(dst), "+S"(src), "+c"(len) : : "memory");
#elif TODO == TEMPORAL || TODO == NON_TEMPORAL
// test vmovtndq or vmovdqa
for (; dst < dst_end;) {
__m256i lo = load(src);
__m256i hi = load(src + 32);
store(dst, lo);
store(dst + 32, hi);
dst += 64;
src += 64;
}
#endif
asm volatile("lfence\n\tmfence" : : : "memory");
assert(!munmap(dst_alloc, len_alloc));
assert(!munmap(src_alloc, len_alloc));
}
int
main(int argc, char ** argv) {
bench();
}
- Build (assuming file name is rfo_test.c):
gcc -O3 -march=native -mtune=native rfo_test.c -o rfo_test
- Run (assuming executable is rfo_test):
perf stat -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Test Data
Note: Data with less noise in edit2
- TODO = TEMPORAL
583,912,867 cpu-cycles
9,352,817 l2_rqsts.all_rfo
188,343,479 offcore_requests_outstanding.cycles_with_demand_rfo
11,560,370 offcore_requests.demand_rfo
0.166557783 seconds time elapsed
0.044670000 seconds user
0.121828000 seconds sys
- TODO = NON_TEMPORAL
560,933,296 cpu-cycles
7,428,210 l2_rqsts.all_rfo
123,174,665 offcore_requests_outstanding.cycles_with_demand_rfo
8,402,627 offcore_requests.demand_rfo
0.156790873 seconds time elapsed
0.032157000 seconds user
0.124608000 seconds sys
- TODO = REP_MOVSB
566,898,220 cpu-cycles
11,626,162 l2_rqsts.all_rfo
178,043,659 offcore_requests_outstanding.cycles_with_demand_rfo
12,611,324 offcore_requests.demand_rfo
0.163038739 seconds time elapsed
0.040749000 seconds user
0.122248000 seconds sys
- TODO = NONE_OF_THE_ABOVE
521,061,304 cpu-cycles
7,527,122 l2_rqsts.all_rfo
123,132,321 offcore_requests_outstanding.cycles_with_demand_rfo
8,426,613 offcore_requests.demand_rfo
0.139873929 seconds time elapsed
0.007991000 seconds user
0.131854000 seconds sys
Test Results
The baseline RFO requests with just the setup but without the memcpy
is in TODO = NONE_OF_THE_ABOVE
with 7,527,122 RFO requests.
With TODO = TEMPORAL
(using vmovdqa
) we can see 9,352,817 RFO requests. This is lower than with TODO = REP_MOVSB
(using rep movsb
) which has 11,626,162 RFO requests. ~2 million more RFO requests with rep movsb
than with Temporal Stores. The only case I was able to see RFO requests avoided was the TODO = NON_TEMPORAL
(using vmovntdq
) which has 7,428,210 RFO requests, about the same as the baseline indicating none from the memcpy
itself.
I played around with different sizes for memcpy thinking I might need to decrease / increase the size for rep movsb
to make that optimization but I have been seeing the same general results. For all sizes I tested I see the number of RFO requests in the following order NON_TEMPORAL
< TEMPORAL
< REP_MOVSB
.
Theories
- [Unlikely] Something new on Icelake?
Edit: @PeterCordes was able to reproduc the results on Skylake
I don't think this is an Icelake specific thing as the only changes I could find in the Intel Manual on rep movsb
for Icelake are:
Beginning with processors based on Ice Lake Client microarchitecture, REP MOVSB performance of short operations is enhanced. The enhancement applies to string lengths between 1 and 128 bytes long. Support for fast-short REP MOVSB is enumerated by the CPUID feature flag: CPUID [EAX=7H, ECX=0H).EDX.FAST_SHORT_REP_MOVSB[bit 4] = 1. There is no change in the REP STOS performance.
Which should not be playing a factor in the test program I am using given that len
is well above 128.
- [Likelier] My test program is broken
I don't see any issues but this is a very surprising result. At the very least verified that the compiler is not optimizing out the tests here
Edit: Fixed build instructions to use G++
instead of GCC
and file postfix from .c
to .cc
Edit2:
Back to C and GCC.
- Better Pref Recipe:
perf stat --all-user -e cpu-cycles -e l2_rqsts.all_rfo -e offcore_requests_outstanding.cycles_with_demand_rfo -e offcore_requests.demand_rfo ./rfo_test
Numbers with better perf recipe (same trend but less noise):
- TODO = TEMPORAL
161,214,341 cpu-cycles
1,984,998 l2_rqsts.all_rfo
61,238,129 offcore_requests_outstanding.cycles_with_demand_rfo
3,161,504 offcore_requests.demand_rfo
0.169413413 seconds time elapsed
0.044371000 seconds user
0.125045000 seconds sys
- TODO = NON_TEMPORAL
142,689,742 cpu-cycles
3,106 l2_rqsts.all_rfo
4,581 offcore_requests_outstanding.cycles_with_demand_rfo
30 offcore_requests.demand_rfo
0.166300952 seconds time elapsed
0.032462000 seconds user
0.133907000 seconds sys
- TODO = REP_MOVSB
150,630,752 cpu-cycles
4,194,202 l2_rqsts.all_rfo
54,764,929 offcore_requests_outstanding.cycles_with_demand_rfo
4,194,016 offcore_requests.demand_rfo
0.166844489 seconds time elapsed
0.036620000 seconds user
0.130205000 seconds sys
- TODO = NONE_OF_THE_ABOVE
89,611,571 cpu-cycles
321 l2_rqsts.all_rfo
3,936 offcore_requests_outstanding.cycles_with_demand_rfo
19 offcore_requests.demand_rfo
0.142347046 seconds time elapsed
0.016264000 seconds user
0.126046000 seconds sys
Edit3: This may have to do with hiding RFO events triggered by the L2 Prefetcher
I used the pref recipe @BeeOnRope made that include RFO events started by the L2 Prefetcher:
perf stat --all-user -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/ ./rfo_test
And the equivilent perf recipe without L2 Prefetch events:
perf stat --all-user -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/ ./rfo_test
And got more reasonable results:
Tl;dr; w/ prefetching numbers we see less RFO requests with rep movsb
. But it does not appear that rep movsb
actually avoids RFO requests, rather it just touch less cache lines
Data With and Without Prefetch Triggered Events Included
TODO = | Perf Event | w/ Prefetching | w/o Prefetching | Difference |
---|---|---|---|---|
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
TEMPORAL | l2_rqsts_references | 16812993 | 4358692 | 12454301 |
TEMPORAL | l2_rqsts_all_rfo | 14443392 | 1981560 | 12461832 |
TEMPORAL | l2_rqsts_rfo_hit | 1297932 | 1038243 | 259689 |
TEMPORAL | l2_rqsts_rfo_miss | 13145460 | 943317 | 12202143 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
NON_TEMPORAL | l2_rqsts_references | 8820287 | 1946591 | 6873696 |
NON_TEMPORAL | l2_rqsts_all_rfo | 6852605 | 346 | 6852259 |
NON_TEMPORAL | l2_rqsts_rfo_hit | 66845 | 317 | 66528 |
NON_TEMPORAL | l2_rqsts_rfo_miss | 6785760 | 29 | 6785731 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
REP_MOVSB | l2_rqsts_references | 11856549 | 7400277 | 4456272 |
REP_MOVSB | l2_rqsts_all_rfo | 8633330 | 4194510 | 4438820 |
REP_MOVSB | l2_rqsts_rfo_hit | 1394372 | 546 | 1393826 |
REP_MOVSB | l2_rqsts_rfo_miss | 7238958 | 4193964 | 3044994 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
LOAD_ONLY_TEMPORAL | l2_rqsts_references | 6058269 | 619924 | 5438345 |
LOAD_ONLY_TEMPORAL | l2_rqsts_all_rfo | 5103905 | 337 | 5103568 |
LOAD_ONLY_TEMPORAL | l2_rqsts_rfo_hit | 438518 | 311 | 438207 |
LOAD_ONLY_TEMPORAL | l2_rqsts_rfo_miss | 4665387 | 26 | 4665361 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
STORE_ONLY_TEMPORAL | l2_rqsts_references | 8069068 | 837616 | 7231452 |
STORE_ONLY_TEMPORAL | l2_rqsts_all_rfo | 8033854 | 802969 | 7230885 |
STORE_ONLY_TEMPORAL | l2_rqsts_rfo_hit | 585938 | 576955 | 8983 |
STORE_ONLY_TEMPORAL | l2_rqsts_rfo_miss | 7447916 | 226014 | 7221902 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
STORE_ONLY_REP_STOSB | l2_rqsts_references | 4296169 | 4228643 | 67526 |
STORE_ONLY_REP_STOSB | l2_rqsts_all_rfo | 4261756 | 4194548 | 67208 |
STORE_ONLY_REP_STOSB | l2_rqsts_rfo_hit | 17337 | 309 | 17028 |
STORE_ONLY_REP_STOSB | l2_rqsts_rfo_miss | 4244419 | 4194239 | 50180 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
STORE_ONLY_NON_TEMPORAL | l2_rqsts_references | 99713 | 36112 | 63601 |
STORE_ONLY_NON_TEMPORAL | l2_rqsts_all_rfo | 64148 | 427 | 63721 |
STORE_ONLY_NON_TEMPORAL | l2_rqsts_rfo_hit | 17091 | 398 | 16693 |
STORE_ONLY_NON_TEMPORAL | l2_rqsts_rfo_miss | 47057 | 29 | 47028 |
---------------------- | ---------------------- | ---------------------- | ---------------------- | ---------------------- |
NONE_OF_THE_ABOVE | l2_rqsts_references | 74074 | 27656 | 46418 |
NONE_OF_THE_ABOVE | l2_rqsts_all_rfo | 46833 | 375 | 46458 |
NONE_OF_THE_ABOVE | l2_rqsts_rfo_hit | 16366 | 344 | 16022 |
NONE_OF_THE_ABOVE | l2_rqsts_rfo_miss | 30467 | 31 | 30436 |
It seems most of the RFO differences boil down to prefetching Enhanced REP MOVSB for memcpy
Issuing prefetch requests immediately and exactly. Hardware prefetching does a good job at detecting memcpy-like patterns, but it still takes a couple of reads to kick in and will "over-prefetch" many cache lines beyond the end of the copied region. rep movsb knows exactly the region size and can prefetch exactly.
Stores
It all appears to come down to rep movsb
not prefetching store addresses causing less lines to require an RFO request. With STORE_ONLY_REP_STOSB
we can get a better idea of where the RFO requests are saved with rep movsb
(assuming the two are implemented simliarly). With Prefetching events NOT counted, we see rep movsb
having about the exact same number of RFO requests as rep stosb
(and same breakdown of HITS / MISSES). It has about ~2.5 million extra L2 references which are fair to attribute to the loads.
Whats especially interesting for the STORE_ONLY_REP_STOSB
numbers is that they barely change with prefetch vs non-prefetch data. This makes me think that rep stosb
at the very least is NOT prefetching the store address. This also corresponds with the fact that we see almost no RFO_HITS
and almost entirely RFO_MISSES
. Temporal Store memcpy, on the otherhand IS prefetching the store address so the origional numbers where skewed in that they didn't count the store RFO requests from vmovdqa
but counted all of them from rep movsb
.
Another pointer of interest is that STORE_ONLY_REP_STOSB
still has many RFO requests compared with STORE_ONLY_NON_TEMORAL
. This makes me think rep movsb
/rep stosb
is only saving RFO requests on stores because it is not making extra prefetches but it is using a temporal store that goes through cache. One thing I am having a hard time reconcilling is it seems the stores from rep movsb
/ rep stosb
neither prefetch not use non-temporal stores that include an RFO so I am unsure how it has comparable performance.
Loads
I think rep movsb
is prefetching loads and it is doing a better job of it that standard vmovdqa
loop. If you look at the diff between rep movsb
w/ and w/o prefetch and the diff for LOAD_ONLY_TEMPORAL
you see about the same pattern with the numbers of LOAD_ONLY_TEMPORAL
being about 20% higher for references but lower for hits. This would indicate the the vmovdqa
loop is doing extra prefetches past the tail and prefetching less effectively. So rep movsb
does a better job prefetching the load address (thus less total references and higher hit rate).
Results
The following is what I am thinking from the data:
rep movsb
does NOT optimize out RFO requests for a given load/store- Maybe its a different type of RFO request that does not require data to be sent but have been unable to find a counter to test this.
rep movsb
does not prefetch stores and does not use non-temporal stores. It thus uses less RFO requests for stores because it doesn't pull in unnecissary lines with prefetching.- Possible it is expecting the store buffer to hide the latency from getting the lines into cache as it knows that there is never a dependency on the stored value.
- Possible that the heuristic is a false invalidation of another cores data is too expensive so it doesn't want to prefetch lines for E/M state.
- I have a hard time reconciling this with "good performance"
rep movsb
is prefetching loads and does so better than a normal temporal load loop.
Edit4:
Using new perf recipe to measure uncore reads / writes:
perf stat -a -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" ./rfo_test
The idea is the if rep stosb
is send RFO-ND then it should have about the same numbers as movntdq
. This seems to be the case.
- TODO = STORE_ONLY_REP_STOSB
24,251,861 data_reads
52,130,870 data_writes
- TODO = STORE_ONLY_TEMPORAL
- Note: this is done with
vmovdqa ymm, (%reg)
. This is not a 64 byte store so an RFO w/ data should be necessary. I did test this withvmodqa32 zmm, (%reg)
and saw about the same numbers. That means either 1)zmm
stores are not optimized to skip the RFO in favor of an ItoM, or 2) these events are not indicative of what I think they are Beware.
- Note: this is done with
39,785,140 data_reads
35,225,418 data_writes
- TODO = STORE_ONLY_NON_TEMPORAL
22,680,373 data_reads
51,057,807 data_writes
One thing that is strange is that while reads are lower for STORE_ONLY_NON_TEMPORAL
and STORE_ONLY_REP_STOSB
writes are higher for both of them.
There is a real name of RFO-ND; ItoM.
- RFO: For writes to part of cache line. If in 'I' needs to have data forwarded to it.
- ItoM: For writes to full cache line. If in 'I' does NOT need data forwarded to it.
Its aggregated with RFO in OFFCORE_REQUESTS.DEMAND_RFO
. Intel has a performance tool that seems sample its value from MSR but they don't have support for ICL and so far am having trouble finding documentation for ICL. Need to investigate more into how to isolate it.
Edit5: The reason for less writes with STORE_ONLY_TEMPORAL
earlier was zero store elimination.
One of this issue with my measurement method is the uncore_imc
events arent supported with the all-user
option. I changed up the perf recipe a bit to try and mitigate this:
perf stat -D 1000 -C 0 -e "uncore_imc/event=0x01,name=data_reads/" -e "uncore_imc/event=0x02,name=data_writes/" taskset -c 0 ./rfo_test
I pin rfo_test
to core 0 and only collect stats on core 0. As well I only start collecting stats after the first second and usleep
in the benchmark until the 1 second mark after setup has completed. Still some noise to I included NONE_OF_THE_ABOVE
which is just the perf numbers from setup / teardown of the benchmark.
- TODO = STORE_ONLY_REP_STOSB
2,951,318 data_reads
18,034,260 data_writes
- TODO = STORE_ONLY_TEMPORAL
20,021,299 data_reads
18,048,681 data_writes
- TODO = STORE_ONLY_NON_TEMPORAL
2,876,755 data_reads
18,030,816 data_writes
- TODO = NONE_OF_THE_ABOVE
2,942,999 data_reads
1,274,211 data_writes