2

I'm searching a way to estimate the number of L3 cache-misses by using 'IA32_PERFEVTSELx' and 'IA32_PMCx' MSR pair on my Linux PC with Intel CPU (Intel i7 6700 skylake). To do that, I installed a timer in the kernel and it reported the value of a PMC periodically (1 sec). At the code, I read the value of IA32_PMC1 MSR (mapped to 0xC2) after I write "0x41412E" where EVENT Select part is 0x2E, UMask part is 0x41, 16th-bit is User and 22-th bit is Enable bit relatively to IA32_PERFEVTSEL1 MSR (mapped to 0x187):

uint64_t val = 0x41412E; // UMask:0x41 + EVENT Select:0x2E + User bit + Enable bit
uint64_t ret = 0x0;

rdmsr_safe(0x187, ret); // 0x187 is mapped address of PERFEVTSEL1 MSR
if ( ret != 0x41412E ) {
    if ( wrmsr_safe(0x187, val) ) {
        TEMP_DEBUG("failed to write msr!!!");
    }
}

if ( rdmsr_safe(0xC2, ret) ) { // 0xC2 is mapped address of PMC1 MSR
    TEMP_DEBUG("failed to read msr!!!");
} else {
    TEMP_DEBUG("rdmsr: %lu", ret);
}

Even if I expected that the value represents the number of L3 cache-misses, it seems to be quite strange. Its value is too high so, I suppose that it is not the number of L3 cache-misses and I could not find what does it mean in the manual (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide). The values which I observed are below:

rdmsr: 0 at start_shscan(56) in mcsched.c
rdmsr: 0 at start_shscan(56) in mcsched.c
rdmsr: 8595908 at start_shscan(56) in mcsched.c
rdmsr: 17274482 at start_shscan(56) in mcsched.c
rdmsr: 21449216 at start_shscan(56) in mcsched.c
rdmsr: 26305745 at start_shscan(56) in mcsched.c
rdmsr: 26511242 at start_shscan(56) in mcsched.c
rdmsr: 33316291 at start_shscan(56) in mcsched.c
rdmsr: 34736360 at start_shscan(56) in mcsched.c
rdmsr: 35151932 at start_shscan(56) in mcsched.c
rdmsr: 43806356 at start_shscan(56) in mcsched.c
rdmsr: 51132302 at start_shscan(56) in mcsched.c
rdmsr: 59797757 at start_shscan(56) in mcsched.c
rdmsr: 0 at start_shscan(56) in mcsched.c
rdmsr: 0 at start_shscan(56) in mcsched.c
rdmsr: 6820029 at start_shscan(56) in mcsched.c
rdmsr: 8322078 at start_shscan(56) in mcsched.c
rdmsr: 63313471 at start_shscan(56) in mcsched.c
rdmsr: 397962 at start_shscan(56) in mcsched.c
rdmsr: 9429026 at start_shscan(56) in mcsched.c
rdmsr: 18124455 at start_shscan(56) in mcsched.c
rdmsr: 23706367 at start_shscan(56) in mcsched.c
rdmsr: 27087960 at start_shscan(56) in mcsched.c
rdmsr: 68769660 at start_shscan(56) in mcsched.c
rdmsr: 69110424 at start_shscan(56) in mcsched.c
rdmsr: 78216541 at start_shscan(56) in mcsched.c
rdmsr: 87385467 at start_shscan(56) in mcsched.c
rdmsr: 95083478 at start_shscan(56) in mcsched.c
rdmsr: 101347654 at start_shscan(56) in mcsched.c
rdmsr: 8327692 at start_shscan(56) in mcsched.c
rdmsr: 27377092 at start_shscan(56) in mcsched.c
rdmsr: 36316258 at start_shscan(56) in mcsched.c
rdmsr: 45323291 at start_shscan(56) in mcsched.c
rdmsr: 54366010 at start_shscan(56) in mcsched.c
rdmsr: 63135801 at start_shscan(56) in mcsched.c
rdmsr: 72037000 at start_shscan(56) in mcsched.c
rdmsr: 81032798 at start_shscan(56) in mcsched.c
rdmsr: 89975340 at start_shscan(56) in mcsched.c
rdmsr: 98661287 at start_shscan(56) in mcsched.c
rdmsr: 107482921 at start_shscan(56) in mcsched.c
rdmsr: 116290561 at start_shscan(56) in mcsched.c
rdmsr: 125135979 at start_shscan(56) in mcsched.c
rdmsr: 133920103 at start_shscan(56) in mcsched.c
rdmsr: 142695638 at start_shscan(56) in mcsched.c
rdmsr: 151456156 at start_shscan(56) in mcsched.c
rdmsr: 160171239 at start_shscan(56) in mcsched.c
rdmsr: 168879495 at start_shscan(56) in mcsched.c
rdmsr: 177788861 at start_shscan(56) in mcsched.c
rdmsr: 186589920 at start_shscan(56) in mcsched.c
rdmsr: 195331675 at start_shscan(56) in mcsched.c
rdmsr: 204166715 at start_shscan(56) in mcsched.c
rdmsr: 213045449 at start_shscan(56) in mcsched.c
rdmsr: 221942627 at start_shscan(56) in mcsched.c
rdmsr: 231073520 at start_shscan(56) in mcsched.c

Is there any mistake that I did in the code? or please give me an advise for the values.

======================= Added contents below ==========================

@Peter Cordes , I referred to the Intel manual (Intel® 64 and IA-32 Architectures Software Developer’s Manual Volume 3B: System Programming Guide) and I intended to use 'LLC Misses' which is one of pre-defined architectural performance events in the table below:

Table 18-1. UMask and Event Select Encodings for Pre-Defined Architectural Performance Events in the Intel manual

I think that giving an example in perf is better for helping your understanding: I can use "perf stat -e r412e ls" in perf to estimate L3 cache-misses during "ls" command. "r412e" can be separated to 'r' + '41' + '2e' and r represents '[Raw hardware evnet event descriptor', 41 is UMask (0x41) and 2e is Event Select (0x2e). You can see it by 'perf list'.

nickeys
  • 137
  • 2
  • 10
  • 1
    What name is the event that you think you're measuring? `mem_load_retired.l3_miss`, or one of the other ones? What is running on the CPU core that you're recording counts for? Are the numbers anywhere near what you get for `perf stat --per-core -C $core_number -e LLC-load-misses` for that core? (That command may not quite be right, but you get the idea.) – Peter Cordes Mar 10 '18 at 10:42
  • @PeterCordes I saw your comment and added the above article. Please check it again. Thanks. – nickeys Mar 11 '18 at 10:25
  • Have you tried measuring with `perf stat -e LLC-load-misses` to see if it gives counts as high as what you're seeing with your own measurement code? For this or any other events? Reference cycles is easy to verify, because it has an exact correspondence to wall-clock time, so you know exactly how many cycles there should be. (especially if you use `rdtsc`). – Peter Cordes Mar 11 '18 at 10:38
  • @PeterCordes I didn't compare the 2 groups of the counts. However, when measuring system-wide L3-cache-misses, is such the high value normal? Even the values which I list above are measured every second. I think it is hard for the number of L3 cache-misses to occur in only one second like that. So, I'm curious what is the values and when the PMC is updated. – nickeys Mar 11 '18 at 11:17
  • You haven't said anything about what's running on your computer. On my mostly-idle desktop, just running chromium (with a ton of tabs open) on a minimal KDE desktop, no 3D effects or crap going on, I get about 5k LLC-load-misses per second. With a youtube video playing (720p, Intel graphics upscaling to 1080p), I get about 350k LLC-load-misses per second according to `sudo perf stat --per-core --all -e LLC-load-misses -I 1000`. – Peter Cordes Mar 11 '18 at 11:25
  • 1
    @PeterCordes Ah, I'm sorry. I just executed a micro benchmark which creates 4 threads (fit for the number of physical core), I pin each thread in a physical core and each thread accesses its own data (huge size int array). The benchmark is for threshing on L3 cache. Combining the result you estimated and my result above, my measured values seem to fit roughly the L3 cache-miss. Thanks a lot! – nickeys Mar 11 '18 at 12:24
  • 1
    Yeah, 69 million L3 misses per second seems totally reasonable for a workload that thrashes the cache. That's 1 per 57 clock cycles (with a 4GHz CPU). Memory operations are pipelined, with each core being able to keep maybe 10 outstanding requests for different cache lines in flight. (So throughput = max_concurrency / latency if you don't hit a DRAM throughput bottleneck.) See https://stackoverflow.com/questions/39260020/why-is-skylake-so-much-better-than-broadwell-e-for-single-threaded-memory-throug. – Peter Cordes Mar 11 '18 at 12:59

0 Answers0