3

I have an Intel(R) Core(TM) i7-4720HQ CPU @ 2.60GHz (Haswell) processor. AFAIK, mem_load_uops_retired.l3_miss, counts the number of DRAM demand (i.e., non-prefetch) data read accesses. offcore_response.demand_data_rd.l3_miss.local_dram, as its name suggests, counts the number of demand data reads targeted to DRAM. Therefore, these two events seem to be equivalent (or at least almost the same). But based on the following benchmarks the former event is much less frequent than the latter:

1) Initializing a 1000-Elment Global Array in a Loop in C:

Performance counter stats for '/home/ahmad/Simple Progs/loop':

         1,363      mem_load_uops_retired.l3_miss                                   
         1,543      offcore_response.demand_data_rd.l3_miss.local_dram                                   

   0.000749574 seconds time elapsed

   0.000778000 seconds user
   0.000000000 seconds sys

2) Opening a PDF Document in Evince:

Performance counter stats for '/opt/evince-3.28.4/bin/evince':

       936,152      mem_load_uops_retired.l3_miss                                   
     1,853,998      offcore_response.demand_data_rd.l3_miss.local_dram                                   

   4.346408203 seconds time elapsed

   1.644826000 seconds user
   0.103411000 seconds sys

3) Running Wireshark for 5 seconds:

Performance counter stats for 'wireshark':

     5,161,671      mem_load_uops_retired.l3_miss                                   
     8,126,526      offcore_response.demand_data_rd.l3_miss.local_dram                                   

  15.713828395 seconds time elapsed

   0.904280000 seconds user
   0.693906000 seconds sys

4) Running Blur Filter on an Image in Inkscape:

Performance counter stats for 'inkscape':

    13,852,121      mem_load_uops_retired.l3_miss                                   
    23,475,970      offcore_response.demand_data_rd.l3_miss.local_dram                                   

  25.355643897 seconds time elapsed

   7.244404000 seconds user
   1.019895000 seconds sys

In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable? Why? Please, tell me if the benchmarks are too complicated and coarse-grained!

Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
TheAhmad
  • 810
  • 1
  • 9
  • 21

1 Answers1

5

The following table shows the differences between these two events on Haswell to the best of my (current) knowledge:

mem_load_uops_retired.l3_miss offcore_response.demand _data_rd.l3_miss.local_dram
Cacheable Retired Load Uops Per uop per line Y
Cacheable Non-Retired Load Uops N Y
Uncacheable WC Retired Load Uops One event per line N
Uncacheable UC Retired Load Uops May occur N
Uncacheable WC or UC Non-Retired Load Uops N N
Locked Loads of any type to any memory type May occur I don't know
Legacy IO requests May occur N
L1D Prefetches N Y
L2 Prefetches into L2 or L3 N N
Software prefetches with no intention for write N Y
Page Walk Loads N Y
Servicing Unit Any Local DRAM
Reliability May not be reliable Reliable

It should be clear to you now that these events, in general, are not equivalent at all. Also comparing the counts of these two events to deduce something meaningful is not an easy task.

In all of the examples you presented, the offcore_response.demand_data_rd.l3_miss.local_dram event count is larger than the mem_load_uops_retired.l3_miss event count. However, it's not hard to come up with real examples where the latter is larger than the former.

In all four benchmarks, offcore_response.demand_data_rd.l3_miss.local_dram is nearly twice as frequent as mem_load_uops_retired.l3_miss. Is this reasonable?

I think the description "nearly twice" really only applies to the second example, but not the others. I can't comment on the numbers you've shown without seeing the exact code and execution environment information.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Hadi Brais
  • 22,259
  • 3
  • 54
  • 95
  • Thanks! Can these differences lead to such a large difference in event occurrences? – TheAhmad Mar 03 '21 at 21:40
  • *Opening and Closing `Gedit`*: `817,014` vs. `1,264,826` / *Opening and Closing `Libreoffice`*: `2,795,660` vs. `3,970,107`. Here, around `1.5x`. The `offcore` event is **often** the **more frequent** one. Something should be **missing**! – TheAhmad Mar 03 '21 at 21:44
  • 1
    @TheAhmad: Note in Hadi's table that L1d prefetches count as offcore "demand" requests. If a good fraction of the loads are actually initiated by the L1d prefetcher, not actual load uops, that could explain the observation. (IDK how far ahead the L1d prefetcher looks, so it might not really be helping hide as much latency as one would like. Also, some small fraction of it could be prefetch or speculative exec of real loads past the end of an array, where the loop actually stops looping.) The fact that actual demand misses on mem_load uops are lower hopefully is a sign prefetch is working. – Peter Cordes Mar 04 '21 at 08:44
  • 1
    @TheAhmad Branch mispredictions and other causes of pipeline flushes or replays can play a partial, but very significant role for why `demand_data_rd.l3_miss.local_dram` may be larger than `mem_load_uops_retired.l3_miss`. In Example 1, this seems to be a simple program where most loads will actually end up retiring. The event counts are fairly close here. Although you're counting both kernel and user mode events, so these counts are perturbed by systems calls and interrupt handlers. Ultimately, it depends on the exact code being profiled. – Hadi Brais Mar 04 '21 at 16:05
  • Firstly, Thanks! I think I should accept this *informative* answer, until, possibly, a *more thorough* answer is provided. – TheAhmad Mar 04 '21 at 16:39
  • Big difference is that the "retired" events count what happened only for uops that retire. It's possible to have many or even most accesses on a speculative path that never retires, which are invisible to the "retired" counters. – BeeOnRope Mar 07 '21 at 05:34