5

I am trying to determine time needed to read an element to make sure it's a cache hit or a cache miss. for reading to be in order I use _mm_lfence() function. I got unexpected results and after checking I saw that lfence function's overhead is not deterministic. So I am executing the program that measures this overhead in a loop of for example 100 000 iteration. I get results of more than 1000 clock cycle for one iteration and next time it's 200. What can be a reason of such difference between lfence function overheads and if it is so unreliable how can I judge latency of cache hits and cache misses correctly? I was trying to use same approach as in this post: Memory latency measurement with time stamp counter

the code that gives unreliable results is this:

for(int i=0; i < arr_size; i++){
  _mm_mfence();
  _mm_lfence();
   t1 = __rdtsc();
  _mm_lfence();
  _mm_lfence();
   t2 = __rdtsc();
  _mm_lfence();

   arr[i] = t2-t1;
}

the values in arr vary in different ranges, arr_size is 100 000.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Ana Khorguani
  • 896
  • 4
  • 18
  • 1
    is your code the only thing running on your system? interrupts and task swaps can all cause unexpected differences in the clock – Tom Tanner Feb 04 '19 at 17:30
  • No it's not but I am counting ru_nvcsw, and ru_nivcsw that tell how often did the context switch happen, it's equal to 4. Should I also check interrupts for this process? Or how can I do that? – Ana Khorguani Feb 04 '19 at 17:48
  • If you are looking for other code samples for doing this, check github for spectre/meltdown/l1tf PoCs since access time measurement is a necessary component for those. – ruthafjord Feb 04 '19 at 21:19

1 Answers1

3

I get results of more than 1000 clock cycle for one iteration and next time it's 200.

Sounds like your CPU ramped up from idle to normal clock speed after the first few iterations.

Remember that RDTSC counts reference cycles (fixed frequency, equal or close to the max non-turbo frequency of the CPU), not core clock cycles. (idle/turbo / whatever). Older CPUs had RDTSC count core clock cycles, but for years now CPU vendors have had fixed RDTSC frequency making it useful for clock_gettime(), and advertized this fact with the invariant_tsc CPUID feature bit. See also Get CPU cycle count?

If you really want to use RDTSC instead of performance counters, disable turbo and use a warm-up loop to get your CPU to its max frequency.


There are libraries that let you program the HW performance counters, and set permissions so you can run rdpmc in user-space. This actually has lower overhead than rdtsc. See What will be the exact code to get count of last level cache misses on Intel Kaby Lake architecture for a summary of ways to access perf counters in user-space.

I also found a paper about adding user-space rdpmc support to Linux perf (PAPI): ftp://ftp.cs.uoregon.edu/pub/malony/ESPT/Papers/espt-paper-1.pdf. IDK if that made it into mainline kernel/perf code or not.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Hello, so I read about RDTSC and I think I got the point of this: "It ticks at the CPU's rated frequency". So the problem is that it starts ticking at advertised frequency independent from actual CPUs in the moment frequency. Then this should mean that the problem is not in lfence function and it takes the same time, but the time is not measured correctly. – Ana Khorguani Feb 06 '19 at 13:16
  • Well honestly I don't have strong preferences to one concrete way of doing things, I just want the result to be correct :) about "Performance Monitoring Events" if I am not wrong, with it I can read number of cache loads for example as I saw in the referenced link. But this number will give me the entire count for the process, maybe not exactly number of concrete cache miss and cache hit I am interested in. Or can I configure it so I can count the cache miss or hit, for a specific read? – Ana Khorguani Feb 06 '19 at 13:21
  • @AnaKhorguani: you can put `rdpmc` before and after a block or single instruction, and subtract. (You might need `lfence` to prevent reordering, I'm not sure. I don't think it's serializing on the instruction stream.) And BTW, `rdtsc` does measure time correctly. It's just that time isn't what you want, it's core clock cycles. (Because the `lfence` + `rdtsc` measurement overhead is a constant amount of cycles, not nanoseconds.) – Peter Cordes Feb 06 '19 at 13:47
  • 1
    perf definitely supports user-space rdpmc based counter queries. This is the easiest way to get rdpmc working on Linux. – BeeOnRope Feb 07 '19 at 03:27
  • I am trying to test PAPI library which I hope I correctly installed. In a separate file I have my .c program and I include this #include header. I think it should not be a problem, as I am able to define PAPI_event_info_t type variable. However when I try to compile code with statement: PAPI_library_init I get error: (.text+0x41): undefined reference to `PAPI_library_init'. I am trying to figure out what the problem is and googling a lot but any chance you could suggest how can I compile program with PAPI library functions? – Ana Khorguani Feb 07 '19 at 19:39
  • should I compile with additional flags? – Ana Khorguani Feb 07 '19 at 20:02
  • @AnaKhorguani: yes, obviously you need the right `-l` library option. I googled, too, but didn't find what library actually contains the perf functions, though. – Peter Cordes Feb 07 '19 at 23:36
  • I found -lpapi and after adding library in path it worked. I am able to compile and run the program with papi functions but I got another problem now. when I try this: if ((num_hwcntrs = PAPI_num_counters()) <= PAPI_OK) printf("This system has %d available counters. \n", num_hwcntrs); I get that system has 0 available counters. not sure what the problem might be. thought I will post it as a new question – Ana Khorguani Feb 08 '19 at 18:05
  • @PeterCordes after trying things I found that Number Hardware Counters on my laptop is 0. Is there any hope left for me to use papi or that's a dead end? – Ana Khorguani Feb 08 '19 at 22:10
  • @AnaKhorguani: Have you tried `perf stat -d ./a.out` to see if it can use any HW counters? If it works for cycles/instructions/etc, maybe you're using the library wrong. – Peter Cordes Feb 09 '19 at 02:48
  • @PeterCordes Thank you, your comment for the other post helped. Now I can test more with papi and hopefully I will be able to get more precise results for CPU cache misses and hits. – Ana Khorguani Feb 09 '19 at 20:31