How to measure late prefetches and killed prefetches on Haswell micro-architecture?

Question

I am using Intel Xeon 2660 v3 and issuing lots of software prefetches to exploit the MLP as well as to reduce the stall time. Now I want to profile the application to get the overall gain due to software prefetches.

In the paper "Improving the Effectiveness of Software Prefetching with Adaptive Execution", the authors have discussed the performance counter support in the hardware related to software prefetching.

I am putting the text from the paper, where the authors talked about the performance counters.

Furthermore, the only hardware support required by the best adaptive scheme is a pair of counters: one measuring the number of late prefetches (the ones arriving after the processor has requested the data) and another one measuring the number of prefetches killed as a result of cache conflicts.

I want to profile the application for Haswell microarchitecture but couldn't find any such performance counter in Perf or PAPI. So, are there any other performance counters to get such events and what is the best possible way to do it for the small part of the code instead of doing it for the full application?

Paper Link

Peter Cordes · Answer 1 · 2017-12-29T01:58:44.557

ocperf.py is a wrapper for perf with symbolic names for uarch-specific events like load_hit_pre.sw_pf (counts when a demand-load dispatched to a load port hits an L1D fill buffer (FB) allocated for software prefetch). ocperf.py list has descriptions as well as names.

That's probably a useful one to look at, but I haven't used it myself so IDK if it's really doing to be exactly what you need. Definitely look through the event list (ocperf.py list | less).

You should also look at L1D miss rate; with successful prefetching that manages to stay ahead of demand loads, the actual load instructions should hit in L1D. (And plain perf can track this with L1-dcache-load-misses.)

For measuring lines that were prefetched but evicted before use, there's l2_lines_out.useless_hwpf. "Counts the number of lines that have been hardware prefetched but not used and now evicted by L2 cache". l2_lines_out.useless_pref is an alias for that; it doesn't look like there's a similar event which include SW prefetch.

You may just need to look at the L1D miss rate; that should tell you where the range of sweet spots for prefetch distance lies. If load_hit_pre.sw_pf works as I hope, then L1D misses with low counts for load_hit_pre.sw_pf means your prefetch distance is too high. (Or that SW prefetch requests are being dropped for some other reason, but I think only HW prefetch requests get dropped when there's a lot of demand-load utilization).

perf-counter hardware events for stores are much more limited than for loads, so if you're trying to prefetch for a write-only stream, it's going to be harder to measure. The HW prefetcher in L1D may not even prefetch for stores at all, so different access patterns for write-only streams can suffer a lot. See also @BeeonRope's comment on this answer: SW prefetch for stores can help if they hit in L2 but not L1D. prefetchw is ideal, but plain prefetcht0 is still useful. (prefetchw runs as a NOP on Haswell and ealier.)

See also other links in the x86 tag wiki

`l2_lines_out_useless_hwpf` and `l2_lines_out_useless_pref` are two different names for the same event. Note also that recent `perf` versions also have the arch specific events. Since I updated to the 4.10.x kernel, `perf` has pretty much the same events as `ocperf` (i.e., all the Skylake ones) including the two you mention. About stores and the L1 prefetchers, it is my impression that the L1 hardware prefetchers are _never triggered by stores_. You can effectively SW prefetch for stores even without `prefetchw`, however. — BeeOnRope, Dec 28 '17 at 21:30
Hi Peter, if that's not too much of a burden, do you mind clarifying the part about `prefetchw` being a NOP? Thank you! — Margaret Bloom, Dec 29 '17 at 08:39
@MargaretBloom: Before Core2, Intel CPUs #UD on the `prefetchw` 3dNOW instruction. From Core2 to Haswell, it's a NOP. (Windows 8.1 for x86-64 requires that `prefetchw` doesn't fault; maybe Microsoft asked for Intel to run it at least as a NOP.) From Broadwell onward, Intel CPUs set the `prefetchw` CPUID feature bit and run it as an actual prefetch into Exclusive state. — Peter Cordes, Dec 29 '17 at 13:56

How to measure late prefetches and killed prefetches on Haswell micro-architecture?

1 Answers1

Linked