I have been trying a few experiments on x86 - namely the effect of mfence
on store/load latencies, etc.
Here is what I have started with:
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
char array[ARRAY_SIZE];
for (int i =0; i< ARRAY_SIZE; i++)
array[i] = 'x'; //This is to force the OS to give allocate the array
asm volatile ("mfence\n");
for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor
struct result_tuple{
uint64_t tsp_start;
uint64_t tsp_end;
int offset;
};
struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
for (int i = 0; i< ARRAY_SIZE; i++)
{
uint64_t *tsp_start,*tsp_end;
tsp_start = &results[i].tsp_start;
tsp_end = &results[i].tsp_end;
results[i].offset = i;
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_start)
::"rax","rdx","rcx","memory"
);
array[i] = 'y'; //A simple store
asm volatile (
"mfence\n"
"rdtscp\n"
"mov %%rdx,%[arg]\n"
"shl $32,%[arg]\n"
"or %%rax,%[arg]\n"
:[arg]"=&r"(*tsp_end)
::"rax","rdx","rcx","memory"
);
}
printf("Offset\tLatency\n");
for (int i=0;i<ARRAY_SIZE;i++)
{
printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
}
free (results);
}
I compile quite simply with gcc microbenchmark.c -o microbenchmark
My system configuration is as follows:
CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)
My issue is this:
- In a single run, all the latencies are similar
- When repeating the experiment over and over, I don't get results similar to the previous run!
For instance:
In run 1 I get:
Offset Latency
1 275
2 262
3 262
4 262
5 275
...
252 275
253 275
254 262
255 262
In another run I get:
Offset Latency
1 75
2 75
3 75
4 72
5 72
...
251 72
252 72
253 75
254 75
255 72
This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?
Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence
should flush the store buffer, thereby ensuring that no stores are 'delayed'.