0

I had posted a question in similar context over here

After figuring out a few issues, I have brought down the jitter.

I will describe my scenario.

My kernel boot parameters look like:

nmi_watchdog=0 intel_idle.max_cstate=0 processr.max_cstate=0 nohz_full=7-11 isolcpus=7-11 mce=off rcu_nocbs=7-11 nosoftlockup cpuidle.off=1 powersave=off nonmi_ipi nnwatchdog

I have a kernel module, which is responsible to send some packets at a given interval(here I am sending every 1ms).

  • I have a packet generator pinned to CPU 9
  • I have a kernel module(or Kthread) pinned to CPU 8
  • I have set the IRQ affinity of my rx queue to CPU 10

Thus, I executed the following command to get perf stats

sudo ./perf stat -a -d -I 1000 --cpu=8 taskset -c 9 ./test.sh

Below, I have posted an excerpt of the output that I got. From the above command, I am trying to profile the events of my CPU Core 8.

So, in this way, these components should not interfere with each other.

     5.002780500        1000.296809      cpu-clock (msec)          #    1.000 CPUs utilized
     5.002780500                  0      context-switches          #    0.000 K/sec
     5.002780500                  0      cpu-migrations            #    0.000 K/sec
     5.002780500                  0      page-faults               #    0.000 K/sec
     5.002780500             88,531      cycles                    #    0.000 GHz
     5.002780500             29,738      instructions              #    0.33  insn per cycle
     5.002780500              6,639      branches                  #    0.007 M/sec
     5.002780500                118      branch-misses             #    1.72% of all branches
     5.002780500              7,677      L1-dcache-loads           #    0.008 M/sec
     5.002780500                318      L1-dcache-load-misses     #    4.04% of all L1-dcache hits
     5.002780500                196      LLC-loads                 #    0.196 K/sec
     5.002780500                169      LLC-load-misses           #   84.08% of all LL-cache hits
Round 0
     6.330091222        1327.302728      cpu-clock (msec)          #    1.327 CPUs utilized
     6.330091222                  1      context-switches          #    0.001 K/sec
     6.330091222                  1      cpu-migrations            #    0.001 K/sec
     6.330091222                  0      page-faults               #    0.000 K/sec
     6.330091222      2,401,268,484      cycles                    #    2.276 GHz
     6.330091222      1,700,438,285      instructions              #    4.25  insn per cycle
     6.330091222        400,075,413      branches                  #  379.216 M/sec
     6.330091222              9,587      branch-misses             #    0.01% of all branches
     6.330091222        300,135,708      L1-dcache-loads           #  284.487 M/sec
     6.330091222             12,520      L1-dcache-load-misses     #    0.03% of all L1-dcache hits
     6.330091222              6,865      LLC-loads                 #    0.007 M/sec
     6.330091222              5,177      LLC-load-misses           #  394.69% of all LL-cache hits
Round 1
     7.343309295        1013.219838      cpu-clock (msec)          #    1.013 CPUs utilized
     7.343309295                  2      context-switches          #    0.002 K/sec
     7.343309295                  1      cpu-migrations            #    0.001 K/sec
     7.343309295                  0      page-faults               #    0.000 K/sec
     7.343309295      2,401,313,050      cycles                    #    2.289 GHz
     7.343309295      1,700,446,830      instructions              #    2.48  insn per cycle
     7.343309295        400,076,590      branches                  #  381.375 M/sec
     7.343309295              9,722      branch-misses             #    0.01% of all branches
     7.343309295        300,137,590      L1-dcache-loads           #  286.108 M/sec
     7.343309295             12,429      L1-dcache-load-misses     #    0.01% of all L1-dcache hits
     7.343309295              6,787      LLC-loads                 #    0.006 M/sec
     7.343309295              5,167      LLC-load-misses           #  246.77% of all LL-cache hits

The words 'Round "x"' means that we are sending 1 packet every ms thus 1000 packets every ms thus every round.

What I have not been able to understand from the above dump is LLC-load-misses. Or, to be precise, I am not able to find a way to dig deep into the source of this issue.

Any inputs on this issue would be very helpful.

Regards, Kushal.

cooshal
  • 758
  • 8
  • 21
  • You mean that `LLC-load-misses` can be > 100% of `LL-cache hits`? I think there was a similar question recently about what exactly those events count on different CPUs. – Peter Cordes Sep 02 '18 at 16:46
  • yes. for instance, at the end of Round 0, the value of LLC-load-misses is 394.69%. I wanted to dig deeper into the cause of this, but not sure where should I exactly look into. As far I understand, this is the metrics to denote the events that CPU fails to access from the cache, right? – cooshal Sep 02 '18 at 16:51
  • I couldn't find a duplicate; the one I was thinking of was actually about confusing iTLB events: [how to interpret perf iTLB-loads,iTLB-load-misses](https://stackoverflow.com/q/49933319). But anyway, like [Cache misses in an infinite loop with no memory references?](https://stackoverflow.com/q/36256010), the absolute number of LLC load misses is very low. The vast majority of loads are hitting in L1d, and half the rest are hitting in L2. Only the random poor-locality loads are missing in L2, so L3 isn't helping here. I don't know where that `246.77%` is calculated from. – Peter Cordes Sep 02 '18 at 18:52
  • 3
    It is not necessarily weird that this number is > 100%. It is a _ratio_ of cache misses to cache hits, and not the "hit rate", so if you miss 2 out of 3 accesses, you'll have a ratio of 2 misses to 1 hit, and hence a value in perf of 200%. I'm not sure why perf choses to display it this way, but you can calculate the other value easily since you have the raw data for `LLC-load-misses` and `LLC-loads`: for your 394% example they are 5,177 / 6,865 = ~75% miss rate (25% hit rate). That doesn't seem like an unreasonable value for a process with poor locality. – BeeOnRope Sep 02 '18 at 19:49
  • Thank you, Peter and BeeOnRope. @BeeOnRope, I was very eager after reading one of your lines "That doesn't seem like an unreasonable value for a process with poor locality." I am sure there has to be a problem in my implementation. Is there a way that this thing can be improved or tracked down? Thank you again. – cooshal Sep 02 '18 at 21:12
  • indeed !!! :D thank you for your inputs and explanation. – cooshal Sep 02 '18 at 21:16
  • 2
    I mean "poor locality" doesn't necessarily mean a bad design or that it can be improved. If you process is doing very little (e.g,. sending a packet here and these), but what it does do involves communicating across cores, you are _at best_ going to be missing to L3 all the time, since that's the sharing level for most Intel CPUs. If you are receiving packets, you might miss to memory depending on how the network card works (i.e., if it dumps packets into memory or if it has the feature to put the packets in some caching level). – BeeOnRope Sep 02 '18 at 21:20

1 Answers1

7

The number of LLC-load-misses should be interpreted as the number of loads that miss in the last level cache (typically the L3 for modern Intel chips) over the interval measured.

At the level this is measured, I believe loads going to the same cache line have already been "combined" by the line fill buffers: if you access several values all the same cache line which isn't presented in the LLC, these all "miss" from the point of view of your process (the use of any of those values will wait for the full miss duration), but I believe this is only counted as one miss for the LLC-load-misses counter.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • Hi! thank you for your answer. I have one more question here. Is it possible to track down the cause of these events? Those numbers are giving me significantly high and inconsistent jitters. Any pointers on this would be great. – cooshal Sep 02 '18 at 21:08
  • 1
    Sure, you can use `perf record` which can capture the location that various events occured at (sometimes with skid). There is also specifically `perf mem` which is specifically for tracking down memory accesses that miss in various levels of the cache, based on special support for "sampling" such accesses in recent Intel CPUs. @Cooshal – BeeOnRope Sep 02 '18 at 21:16
  • thanks... I have tried perf record, stat and script. But perf stat showed me that "scary" `LLC-load-misses` value and I was a bit concerned. probably, something else is also going on. I will try to figure that out. thanks again @BeeOnRope – cooshal Sep 02 '18 at 22:04
  • 1
    @Cooshal - no problem. The most important thing is to have a good mental model of how you expect your program to behave, in terms of total memory accesses, cache misses, branch predictions, etc - and then you can eyeball some numbers and understand if they are reasonable. – BeeOnRope Sep 03 '18 at 00:22