Performance Counters for DRAM Accesses

Question

I want to retrieve the number of DRAM accesses in my application. Precisely, I need to distinguish between data and code accesses. The processor is an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell). Based on Intel Software Developer's Manual, Volume 3 and Perf, I could find and categorize the following memory-access-related events:

(A)
LLC-load-misses                                    [Hardware cache event]
LLC-loads                                          [Hardware cache event]
LLC-store-misses                                   [Hardware cache event]
LLC-stores                                         [Hardware cache event]
=========================================================================
(B)
mem_load_uops_l3_miss_retired.local_dram          
mem_load_uops_retired.l3_miss  
=========================================================================
(C)
offcore_response.all_code_rd.l3_miss.any_response 
offcore_response.all_code_rd.l3_miss.local_dram   
offcore_response.all_data_rd.l3_miss.any_response 
offcore_response.all_data_rd.l3_miss.local_dram   
offcore_response.all_reads.l3_miss.any_response   
offcore_response.all_reads.l3_miss.local_dram     
offcore_response.all_requests.l3_miss.any_response
=========================================================================
(D)
offcore_response.all_rfo.l3_miss.any_response     
offcore_response.all_rfo.l3_miss.local_dram       
=========================================================================
(E)
offcore_response.demand_code_rd.l3_miss.any_response
offcore_response.demand_code_rd.l3_miss.local_dram
offcore_response.demand_data_rd.l3_miss.any_response
offcore_response.demand_data_rd.l3_miss.local_dram
offcore_response.demand_rfo.l3_miss.any_response  
offcore_response.demand_rfo.l3_miss.local_dram    
=========================================================================
(F)
offcore_response.pf_l2_code_rd.l3_miss.any_response
offcore_response.pf_l2_data_rd.l3_miss.any_response
offcore_response.pf_l2_rfo.l3_miss.any_response   
offcore_response.pf_l3_code_rd.l3_miss.any_response
offcore_response.pf_l3_data_rd.l3_miss.any_response
offcore_response.pf_l3_rfo.l3_miss.any_response

My choices are as follows:

It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).
For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).
Simplistically, LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code (As code is read-only).

Are these choices reasonable?

My other questions: (The 2nd one is the most important)

What are local_dram and any_response?
At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.
Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

Group (D), includes DRAM access events caused by Read for Ownership operations (for Cache Coherency Protocols). It seems irrelevant to my problem.

Group (F), counts DRAM reads caused by L2-cache prefetcher which is also irrelevant to my problem.

Note that multiple misses to the same cache line (at the same time) will only trigger one `LLC-load-misses` event, but IIRC each one will count as a `mem_load_uops_retired.l3_miss`. e.g. if you access multiple members of a struct that all come from the same cache line, the load uops will all attach themselves to one LFB to wait for the incoming cache line. — Peter Cordes, Feb 27 '21 at 02:50
@PeterCordes - no, the subsequent misses to the same line should be `mem_load_uops_retired.hit_lfb`. — BeeOnRope, Feb 27 '21 at 09:25
By "DRAM accesses" you mean accesses that originate from cores and miss in the L3 and go to the IMC? Are looking for a solution that works for the i7-4720HQ or for a larger collection of processors? — Hadi Brais, Feb 28 '21 at 10:15
If you don't ping me like this @HadiBrais, I may forget to follow up on your reply. — Hadi Brais, Mar 01 '21 at 17:37

Hadi Brais · Accepted Answer · 2021-03-02T10:22:09.527

Based on my understanding of the question, I recommend using the following two events on the specified processor:

OFFCORE_RESPONSE.ALL_READS.L3_MISS.LOCAL_DRAM: This includes all cacheable data read and write transactions and all code fetch transactions, whether the transaction is initiated by a instruction (retired or not) or a prefetch or any type. Each event represents exactly a 64-byte read request to the memory controller.
OFFCORE_RESPONSE.ALL_CODE_RD.L3_MISS.LOCAL_DRAM: This includes all the code fetch accesses to the IMC.

(I think both of these event don't occur for uncacheable code fetch requests, but I've not tested this and the documentation is not clear on this.)

The "data accesses" can be measured separately from the "code accesses" by subtracting the second event from the first. These two events can be counted simultaneously on the same logical core on Haswell without multiplexing.

There are of course other transactions that do go to the IMC but are not counted by either of the two mentioned events. These include: (1) L3 writebacks, (2) uncacheable partial reads and writes from cores, (3) full WCB evictions, and (4) memory accesses from IO devices. Depending on the workload, It's not unlikely that accesses of types (1), (3), and (4) may constitute a significant fraction of total accesses to the IMC.

It seems that the sum of LLC-load-misses and LLC-store-misses will return the whole DRAM accesses (equivalently, I could use LLC-misses in Perf).

Note the following:

The event LLC-load-misses is a perf event mapped to the native event OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE.
The event LLC-store-misses is mapped to OFFCORE_RESPONSE.DEMAND_RFO.L3_MISS.ANY_RESPONSE.

These are not the events you want because:

The ANY_RESPONSE bit indicates that the event can occur for requests that target any unit, not just the IMC.
These events count L1 data prefetches and page walk requests, but not L2 data prefetches. You'd want to count all prefetches that consume memory bandwdith in general.

For data-only accesses, I used mem_load_uops_retired.l3_miss. It does not include stores, but seems to be OK (because stores seem to be much less frequent?!).

There are a number of issues with using mem_load_uops_retired.l3_miss on Haswell:

There are cases where this event is unreliable, so it should be avoided if there are alternatives. Otherwise, the analysis methodology should take in to account the potential unreliability of this event count.
The event only occurs for requests from retired loads and it omits speculative loads and all stores, which can be significant.
Doing arithmetic with this events and other events in a meaningful way is not easy. For example, your suggestion of doing "LLC-load-misses - mem_load_uops_retired.l3_miss = DRAM Accesses for Code" is incorrect.

What are local_dram and any_response?

Not all requests that miss in the L3 go to the IMC. A typical example is memory-mapped IO requests. You said you only want the core-originated requests that go to the IMC, so local_dram is the right bit.

At first, it seems that, group (C), is a higher resolution version of the load events of group (A). But my tests show that the events in the former group is much more frequent than the latter. For example, in a simple benchmark, the number of offcore_response.all_reads.l3_miss.any_response events were twice as many as LLC-load-misses.

This is normal because offcore_response.all_reads.l3_miss.any_response is inclusive of LLC-load-misses and can easily be significantly larger.

Group (E), pertains to demand reads (i.e., all non-prefetched reads). Does this mean that, e.g.: offcore_response.all_data_rd.l3_miss.any_response - offcore_response.demand_data_rd.l3_miss.any_response = DRAM read accesses caused by prefeching?

No, because:

the any_response bit as explained above,
this subtraction results in only the L2 data load prefetches, not all data load hardware and software prefetches.

Thanks @HadiBrais! **1.** Do you know the difference between `local_dram` and `any_response`? **2.** Why should I use **subtraction** to find data reads? Isn't `all_data_rd` better? **3.** It seems that **unlike** `offcore events`, `LLC_load_misses` and `mem_load_uops_l3_miss_retired.local_dram` do not include **prefetches operations**. Then, why is `LLC_load_misses`, usually **much more frequent** than `mem_load_uops_l3_miss_retired.local_dram` (almost **twice** as frequent in a *simple* becnhmark)? — TheAhmad, Mar 01 '21 at 20:32
`LLC_load_misses` (a.k.a, `OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE`) is **always** almost **twice** as frequent as `mem_load_uops_l3_miss_retired.local_dram` . It seems to be **the rule** rather than the exception. Do I **miss** something? How can I **confirm** that? — TheAhmad, Mar 02 '21 at 11:28
@TheAhmad You didn't provide enough info for me to comment on that. It depends on the code and execution environment. You can post another question if it's important to you. — Hadi Brais, Mar 02 '21 at 11:58
The documentation says events with **unknown data sources** are **excluded** from `mem_load_uops_l3_miss_retired.local_dram`. Can this be the reason for such a large difference? — TheAhmad, Mar 02 '21 at 12:09
@TheAhmad I'd rather not guess without going over all the details. But it's normal for `OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE` to be much larger. — Hadi Brais, Mar 02 '21 at 12:20
Sorry! I mistakenly said `OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.ANY_RESPONSE`. `OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.LOCAL_DRAM` is much larger than `mem_load_uops_l3_miss_retired.local_dram`. But **not much** difference! I will post another question with some simple benchmarks! — TheAhmad, Mar 02 '21 at 13:04
Can `OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_MISS.LOCAL_DRAM` be used to measure *per-process* stats (it seems to be **shared** between cores!), or I should use something like `mem_load_uops_l3_miss_retired.local_dram`? — TheAhmad, Mar 08 '21 at 08:06
@TheAhmad Sure. What makes you think it's shared between cores? — Hadi Brais, Mar 08 '21 at 08:34
Thanks! Well, my knowledge is **not** deep! I thought *shared* counter is a **result** of *shared* resource. Based on your answer, for `OFFCORE` events (similar to `CORE` events) there should be a **dedicated** counter for **each** (*logical?*) core. — TheAhmad, Mar 08 '21 at 08:54
@TheAhmad Yes, this event counts per logical core on Haswell, no problem. — Hadi Brais, Mar 08 '21 at 09:06
But the last line in the caption for figure 1 in the following document seems to contradict this assumption (i.e., reporting per core stats): https://hal.inria.fr/hal-01285522/document — TheAhmad, Mar 09 '21 at 16:09
@TheAhmad The OFFCORE_RESPONSE events are NOT uncore events. The former are programmed on core counters while the latter are programmed on uncore counters. The uncore events are discussed in documents titled something like "uncore monitoring guide." — Hadi Brais, Mar 09 '21 at 16:14

Performance Counters for DRAM Accesses

1 Answers1

Linked