Is _mm_prefetch asynchronous? Profiling shows a lot of cycles on it

Question

Related to Understanding `_mm_prefetch`.

I understood that _mm_prefetch() causes the requested value to be fetched into processor's cache, and my code will be executed while something pre-fetches.

However, my VS2017 profiler states that 5.7% is spent on the line that accesses my cache and 8.63% on the _mm_prefetch line. Is the profiler mistaken? If I am waiting for the data to be fetched, what do I need it for? I could wait in the next function call, when I need it...

On the other hand, the overall timing shows significant benefit of that prefetch call.

So the question is: is the data being fetch asynchronously?

Additional information.

I have multiple caches, for various key width, up to 32-bit keys (that I am currently profiling). The access to cache and pre-fetching are extracted into separate __declspec(noinline) functions to isolate them from surrounding code.

uint8_t* cache[33];

__declspec(noinline)
uint8_t get_cached(uint8_t* address) {
    return *address;
}

__declspec(noinline)
void prefetch(uint8_t* pcache) {
    _mm_prefetch((const char*)pcache, _MM_HINT_T0);
}

int foo(const uint64_t seq64) {
    uint64_t key = seq64 & 0xFFFFFFFF;
    uint8_t* pcache = cache[32];
    int x = get_cached(pcache + key);
    key = (key * 2) & 0xFFFFFFFF;
    pcache += key;
    prefetch(pcache);
    // code that uses x
}

The profiler shows 7.22% for int x = get_cached(pcache + key); line and 8.97% for prefetch(pcache);, while surrounding code shows 0.40-0.45% per line.

`prefetcht0` can retire even if the data hasn't arrived yet. I think they can cause a TLB miss -> page walk, though, which might stall their retirement. Otherwise IDK why a prefetch instruction would be getting a lot of hits for a `cycles` event. Are you sure it's not other address calculation or loads to prepare an address for `prefetcht0`? e.g. loading a pointer from memory, and that load misses in cache? You're talking about C++ source, not asm, but haven't shown what's in that source line. — Peter Cordes, Jan 22 '21 at 20:22
@PeterCordes - thank you! I have isolated all address calculations and updated question with the code. — Vlad Feinstein, Jan 22 '21 at 23:46
It looks like you're doing a regular demand-load from `pcache+key` right before the prefetch. Could they often be in the same cache line, or is `key` large enough for that not to happen. — Peter Cordes, Jan 23 '21 at 00:55
`prefetcht0` might need a load-buffer entry; it might be stalling in the issue/rename/alloc stage (entering the back-end) if all the load buffers are exhausted. That could maybe lead to it getting counts for the cycles event. Although IIRC usually cycles blames the *oldest* un-retired instruction in the ROB. (Out of all the in-flight instructions when it takes a sample). Probably a good idea to investigate this code and see if you can figure out which if any of the ROB / RS / load buffers are full or empty when reaching this code by using other events. — Peter Cordes, Jan 23 '21 at 00:58
Your CPU isn't an IvyBridge is it? Xeon Exxxx v2, or i7-3xxxx. `prefetch` instructions are slow for no apparent reason on IvB, and should be avoided. — Peter Cordes, Jan 23 '21 at 00:59
@PeterCordes - my CPU is i7-8850H (Coffee Lake?) https://ark.intel.com/content/www/us/en/ark/products/134899/intel-core-i7-8850h-processor-9m-cache-up-to-4-30-ghz.html My measurements clear show benefits of using this prefetch... — Vlad Feinstein, Jan 23 '21 at 04:43
@PeterCordes re: `Could they often be in the same cache line` - not likely, the keys are typically large. How do I know the size of the cache line? — Vlad Feinstein, Jan 23 '21 at 04:44
Cache lines are 64 bytes in all modern CPUs, since post Pentium III. (Including many non-x86; not a coincidence it's also the max burst size for DDR RAM). Re: CPU model: I meant prefetch should generally be avoided on IvyBridge specifically. It can certainly help on other CPUs, mostly in cases other than sequential read. But yeah, good that you checked that this particular prefetch is having a net positive effect. — Peter Cordes, Jan 23 '21 at 05:48

score 3 · Accepted Answer · answered Jan 23 '21 at 10:55

Substantially everything on an out-of-order CPU is "asynchronous" in the way you describe (really, running in parallel and out of order). In that sense, prefetch isn't really different than regular loads, which can also run out of order or "async" with other instructions.

Once that is understood, the exact behavior of prefetch is hardware dependent, but it is my observation that:

On Intel, prefetch instructions can retire before their data arrives. So a prefetch instruction that successfully begins execution won't block the CPU pipeline after that. However, note "successfully executes": the prefetch instruction still requires a line fill buffer (MSHR) if it misses in L1 and on Intel it will wait for that resource if not available. So if you issue a lot of prefetch misses in parallel, they end up waiting for fill buffers which makes them act quite similarly to vanilla loads in that scenario.
On AMD Zen [2], prefetches do not wait for a fill buffer if none is available. Presumably, the prefetch is simply dropped. So a large number of prefetch misses behave quite differently than Intel: they will complete very quickly, regardless if they miss or not, but many of the associated lines will not actually be fetched.

Is _mm_prefetch asynchronous? Profiling shows a lot of cycles on it

1 Answers1