I am trying to determine time needed to read an element to make sure it's a cache hit or a cache miss. for reading to be in order I use _mm_lfence() function. I got unexpected results and after checking I saw that lfence function's overhead is not deterministic. So I am executing the program that measures this overhead in a loop of for example 100 000 iteration. I get results of more than 1000 clock cycle for one iteration and next time it's 200. What can be a reason of such difference between lfence function overheads and if it is so unreliable how can I judge latency of cache hits and cache misses correctly? I was trying to use same approach as in this post: Memory latency measurement with time stamp counter
the code that gives unreliable results is this:
for(int i=0; i < arr_size; i++){
_mm_mfence();
_mm_lfence();
t1 = __rdtsc();
_mm_lfence();
_mm_lfence();
t2 = __rdtsc();
_mm_lfence();
arr[i] = t2-t1;
}
the values in arr vary in different ranges, arr_size is 100 000.