Why does using MFENCE with store instruction block prefetching in L1 cache?

Question

I have an object of 64 byte in size:

typedef struct _object{
  int value;
  char pad[60];
} object;

in main I am initializing array of object:

volatile object * array;
int arr_size = 1000000;
array = (object *) malloc(arr_size * sizeof(object));

for(int i=0; i < arr_size; i++){
    array[i].value = 1;
    _mm_clflush(&array[i]);
}
_mm_mfence();

Then loop again through each element. This is the loop I am counting events for:

int tmp;
for(int i=0; i < arr_size-105; i++){
    array[i].value = 2;
    //tmp = array[i].value;
     _mm_mfence();
 }

having mfence does not make any sense here but I was tying something else and accidentally found that if I have store operation, without mfence I get half million of RFO requests (measured by papi L2_RQSTS.ALL_RFO event), which means that another half million was L1 hit, prefetched before demand. However including mfence results in 1 million RFO requests, giving RFO_HITs, that means that cache line is only prefetched in L2, not in L1 cache anymore.

Besides the fact that Intel documentation somehow indicates otherwise: "data can be brought into the caches speculatively just before, during, or after the execution of an MFENCE instruction." I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

So it should not be the case that including mfence blocks prefetching. Both the store and load operations take almost the same time - without mfence 5-6 msec, with mfence 20 msec. I went through other questions regarding mfence but it's not mentioned what is expected behavior for it with prefetching and I don't see good enough reason or explanation why it would block prefetching in L1 cache with only store operations. Or I might be missing something for mfence description?

I am testing on Skylake miroarchitecture, however checked with Broadwell and got the same result.

You're reading too much into the description. The CPU can prefetch data, but doesn't have to. Possibly the latency involved while `mfence` flushes the written data is long enough that the prefetch mechanism doesn't trigger a prefetch? — 1201ProgramAlarm, May 13 '19 at 20:44
For the pure load test, the result makes sense; `mfence` slows it down enough for HW prefetch to keep up. On Skylake with up-to-date microcode, `mfence` blocks out-of-order exec entirely, but not on Broadwell ([Does lock xchg have the same behavior as mfence?](//stackoverflow.com/q/40409297)). Maybe interesting to try with an atomic `xchg` or `lock add` instead of `mfence` as your memory barrier. (On a single global variable or on `(%rsp)`; I think throughput = latency for those so no need to use multiple atomic vars). Or maybe not since you already checked on Broadwell. — Peter Cordes, May 13 '19 at 21:32
@curiousguy for stores I don't most likely, but for load operations since I am not using tmp value afterwards, as I understand, it might be removed from instructions by compiler and won't be executed at all. — Ana Khorguani, May 16 '19 at 08:58

BeeOnRope · Accepted Answer · 2019-05-14T03:10:44.273

It's not L1 prefetching that causes the counter values you see: the effect remains even if you disable the L1 prefetchers. In fact, the effect remains if you disable all prefetchers except the L2 streamer:

wrmsr -a 0x1a4 "$((2#1110))"

If you do disable the L2 streamer, however, the counts are as you'd expect: you see roughly 1,000,000 L2.RFO_MISS and L2.RFO_ALL even without the mfence.

First, it is important to note that the L2_RQSTS.RFO_* events count do not count RFO events originating from the L2 streamer. You can see the details here, but basically the umask for each of the 0x24 RFO events are:

name      umask
RFO_MISS   0x22
RFO_HIT    0x42
ALL_RFO    0xE2

Note that none of the umask values have the 0x10 bit which indicates that events which originate from the L2 streamer should be tracked.

It seems like what happens is that when the L2 streamer is active, many of the events that you might expect to be assigned to one of those events are instead "eaten" by the L2 prefetcher events instead. What likely happens is that the L2 prefetcher is running ahead of the request stream, and when the demand RFO comes in from L1, it finds a request already in progress from the L2 prefetcher. This only increments again the umask |= 0x10 version of the event (indeed I get 2,000,000 total references when including that bit), which means that RFO_MISS and RFO_HIT and RFO_ALL will miss it.

It's somewhat analogous to the "fb_hit" scenario, where L1 loads neither miss nor hit exactly, but hit an in-progress load - but the complication here is the load was initiated by the L2 prefetcher.

The mfence just slows everything down enough that the L2 prefetcher almost always has time to bring the line all the way to L2, giving an RFO_HIT count.

I don't think the L1 prefetchers are involved here at all (shown by the fact that this works the same if you turn them off): as far as I know L1 prefetchers don't interact with stores, only loads.

Here are some useful perf commands you can use to see the difference in including the "L2 streamer origin" bit. Here's w/o the L2 streamer events:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xef,name=l2_rqsts_references/,cpu/event=0x24,umask=0xe2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xc2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x22,name=l2_rqsts_rfo_miss/

and with them included:

perf stat --delay=1000 -e cpu/event=0x24,umask=0xff,name=l2_rqsts_references/,cpu/event=0x24,umask=0xf2,name=l2_rqsts_all_rfo/,cpu/event=0x24,umask=0xd2,name=l2_rqsts_rfo_hit/,cpu/event=0x24,umask=0x32,name=l2_rqsts_rfo_miss/

I ran these against this code (with the sleep(1) lining up with the --delay=1000 command passed to perf to exclude the init code):

#include <time.h>
#include <immintrin.h>
#include <stdio.h>
#include <unistd.h>

typedef struct _object{
  int value;
  char pad[60];
} object;

int main() {
    volatile object * array;
    int arr_size = 1000000;
    array = (object *) malloc(arr_size * sizeof(object));

    for(int i=0; i < arr_size; i++){
        array[i].value = 1;
        _mm_clflush((const void*)&array[i]);
    }
    _mm_mfence();

    sleep(1);
    // printf("Starting main loop after %zu ms\n", (size_t)clock() * 1000u / CLOCKS_PER_SEC);

    int tmp;
    for(int i=0; i < arr_size-105; i++){
        array[i].value = 2;
        //tmp = array[i].value;
        // _mm_mfence();
    }
}

Thank you, I tested with only L2 hardware prefetcher enabled and got the same results. I understand now that it has the similar idea as FB_HIT. I tried perf stat commands and for second one, including L2 streamer events, I also get 2 million for l2_rqsts.references and l2_rqsts.all_rfo as well. Just to be sure I got it right, in this case these events show all incoming requests in L2 and somehow also outgoing ones too right? — Ana Khorguani, May 14 '19 at 09:33
@Ana - no, I think it only counts incoming requests (as long as you consider requests originating in the L2 streamer "incoming"). The reason for 2x the number of counts isn't that outgoing requests are counted but that prefetching generates ~1 additional request for each demand request. Note that you *can* see the outgoing requests too, with the offcore requests counters. — BeeOnRope, May 14 '19 at 14:12
ah ok, this is the case when I have fence and I see one million ALL_RFO requests. And when I see 1/2 ALL_RFO, that reached before prefetcher was able to start processing data, then for those ones should it count twice as well? — Ana Khorguani, May 14 '19 at 14:17
@Ana, I'm not sure what you mean by "this case" but I think the high level counting happens the same in all cases: 1 count for the PF access and 1 count the for the demand access. So _total_ counts will generally be 2m in any case the PF is working. The only question is whether any of those counts show up in the *RFO events, since those don't count PF requests or demand requests that hit a line being PFed, so the total counts there vary and are timing dependent. Does it make sense? — BeeOnRope, May 16 '19 at 01:00
Yes, this makes sens. 1 for prefetcher and one for demand. Just one question I still have is: if I managed to issue RFO request before the cache line was started to be processed by prefetcher, this will be counted by RFO, and then to bring this cache line, it should not be counted as prefetching right? So in references it will be counted once as RFO demand, but should not be counted again for prefetching. — Ana Khorguani, May 16 '19 at 09:03
Yes, in that scenario it will be counted as RFO, but it does not mean it will not _also_ be counted as prefetching: prefetching when enabled will make many requests to the L2 for lines, and it doesn't know which lines are there or not until it checks, and those checks are visible as "hit" type counts in some of the `l2_rqsts` counter. — BeeOnRope, May 17 '19 at 20:00

score 2 · Answer 2 · answered May 14 '19 at 01:44

Regarding the case with store operations, I have run the same loop on a Haswell processor in four different configurations:

MFENCE + E: There is an MFENCE instruction after the store. All hardware prefetchers are enabled.
E : There is no MFENCE. All hardware prefetchers are enabled.
MFENCE + D: There is an MFENCE instruction after the store. All hardware prefetchers are disabled.
D : There is no MFENCE. All hardware prefetchers are disabled.

The results are shown below, which are normalized by the number of stores (each store is to a different cache line). They are very deterministic across multiple runs.

                                 | MFENCE + E |      E     | MFENCE + D |      D     |
    L2_RQSTS.ALL_RFO             |    0.90    |    0.62    |    1.00    |    1.00    |
    L2_RQSTS.RFO_HIT             |    0.80    |    0.12    |    0.00    |    0.00    |
    L2_RQSTS.RFO_MISS            |    0.10    |    0.50    |    1.00    |    1.00    |
    OFFCORE_REQUESTS.DEMAND_RFO  |    0.20    |    0.88    |    1.00    |    1.00    |
    PF_L3_RFO                    |    0.00    |    0.00    |    0.00    |    0.00    |
    PF_RFO                       |    0.80    |    0.16    |    0.00    |    0.00    |
    DMND_RFO                     |    0.19    |    0.84    |    1.00    |    1.00    |

The first four events are core events and the last three events are off-core response events:

L2_RQSTS.ALL_RFO: Occurs for each RFO request to the L2. This includes RFO requests from stores that have retired or otherwise, and RFO requests from PREFETCHW. For the cases where the hardware prefetchers are enabled, the event count is less than what is expected, which is a normalized one. One can think of two possible reasons for this: (1) somehow some of the RFOs hit in the L1, and (2) the event is undercounted. We'll try to figure out which is it by examining the counts of the other events and recalling what we know about the L1D prefetchers.
L2_RQSTS.RFO_HIT and L2_RQSTS.RFO_MISS: Occur for an RFO that hits or misses in the L2, respectively. In all configurations, the sum of the counts of these events is exactly equal to L2_RQSTS.ALL_RFO.
OFFCORE_REQUESTS.DEMAND_RFO: The documentation of this event suggests that it should be the same as L2_RQSTS.RFO_MISS. However, observe that the sum of OFFCORE_REQUESTS.DEMAND_RFO and L2_RQSTS.RFO_HIT is actually equal to one. Thus, it's possible that L2_RQSTS.RFO_MISS undercounts (and so L2_RQSTS.ALL_RFO does too). In fact, this is the most likely explanation because the Intel optimization manual (and other Intel documents) say that only the L2 streamer prefetcher can track stores. The Intel performance counter manual mentions "L1D RFO prefetches" in the description of L2_RQSTS.ALL_RFO. These prefetches probably refer to RFOs from stores that have not retired yet (see the last section of the answer to Why are the user-mode L1 store miss events only counted when there is a store initialization loop?).
PF_L3_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L3 only. All counts of this event are zero.
PF_RFO: Occurs when an RFO from the L2 streamer prefetcher is triggered and the target cache structure is the L2 and possibly the L3 (if the L3 is inclusive, then the line will also be filled into the L3 as well). The count of this event is close to L2_RQSTS.RFO_HIT. In the MFENCE + E case, it seems that 100% of the RFOs have completed on time (before the demand RFO has reached the L2). In the E case, 25% of prefetches did not complete on time or the wrong lines were prefetched. The reason why the number of RFO hits in the L2 is larger in the MFENCE + E case compared to the E case is that the MFENCE instruction delays later RFOs, thereby keeping most of the L2's super queue entries available for the L2 streamer prefetcher. So MFENCE really enables the L2 streamer prefetcher to perform better. Without it, there would be many in-flight demand RFOs at the L2, leaving a small number of super queue entries for prefetching.
DMND_RFO: The same as OFFCORE_REQUESTS.DEMAND_RFO, but it looks like it may undercount a little.

I checked with load operations. without mfence I get up to 2000 L1 hit, whereas with mfence, I have up to 1 million L1 hit (measured with papi MEM_LOAD_RETIRED.L1_HIT event). The cache lines are prefetched in L1 for load instruction.

Regarding the case with load operations, in my experience, MFENCE (or any other fence instruction) has no impact on the behavior of the hardware prefetchers. The true count of the MEM_LOAD_RETIRED.L1_HIT event here is actually very small (< 2000). Most of the events being counted are from MFENCE itself, not the loads. MFENCE (and SFENCE) require sending a fence request all the way to the memory controller to ensure that all pending stores have reached the global observation point. A fence request is not counted as an RFO event, but it may get counted as multiple events, including L1_HIT. For more information on this and similar observations, see my blog post: An Introduction to the Cache Hit and Miss Performance Monitoring Events.

Thank you, I understand the answer for the store case fully, however for load instruction, I was thinking that including mfence was giving more time to L1 prefetchers to bring data in cache, as mentioned by Peter Cordes in the comment. In the blog it says: "At most one L1_HIT event occurs per MFENCE instruction." So if L1_HITs I get are from mfence instruction, then including more than one fence should increase number of L1_HITs, which is not the case I see. With 2 or 3 mfences included in the code I still get up to 1 million L1_HITs. — Ana Khorguani, May 14 '19 at 09:42
@AnaKhorguani Right, I have confused between Haswell and Skylake. The blog post has a separate section on Skylake where it mentions that MFENCE does not seem to cause L1_HIT events, in contrast to Haswell. But note that adding more MFENCEs does not necessarily cause more L1_HIT events (on Haswell). I was not able to recognize a pattern (as mentioned in the blog): sometimes each MFENCE causes a single L1_HIT event and sometimes each MFENCE causes zero L1_HIT events. So it's hard to tell. This issue doesn't seem to exist on Skylake. — Hadi Brais, May 14 '19 at 16:57
Ok, great. I almost started to worry about all my previous results :) Thank you very much. — Ana Khorguani, May 14 '19 at 18:47

Why does using MFENCE with store instruction block prefetching in L1 cache?

2 Answers2

Linked