0

I have been trying a few experiments on x86 - namely the effect of mfence on store/load latencies, etc.

Here is what I have started with:

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>

#define ARRAY_SIZE 10
#define DUMMY_LOOP_CNT 1000000
int main()
{
    char array[ARRAY_SIZE];

    for (int i =0; i< ARRAY_SIZE; i++)
        array[i] = 'x'; //This is to force the OS to give allocate the array

    asm volatile ("mfence\n");
    for (int i=0;i<DUMMY_LOOP_CNT;i++); //A dummy loop to just warmup the processor

    struct result_tuple{
        uint64_t tsp_start;
        uint64_t tsp_end;
        int offset;
        };

    struct result_tuple* results = calloc(ARRAY_SIZE , sizeof (struct result_tuple));
    for (int i = 0; i< ARRAY_SIZE; i++)
    {
        uint64_t *tsp_start,*tsp_end;
        tsp_start = &results[i].tsp_start;
        tsp_end = &results[i].tsp_end;
        results[i].offset = i;

        
        asm volatile (
        "mfence\n"
        "rdtscp\n"
        "mov %%rdx,%[arg]\n"
        "shl $32,%[arg]\n"
        "or %%rax,%[arg]\n"
        :[arg]"=&r"(*tsp_start)
        ::"rax","rdx","rcx","memory"
        );

        array[i] = 'y'; //A simple store

        asm volatile (
        "mfence\n"
        "rdtscp\n"
        "mov %%rdx,%[arg]\n"
        "shl $32,%[arg]\n"
        "or %%rax,%[arg]\n"
        :[arg]"=&r"(*tsp_end)
        ::"rax","rdx","rcx","memory"
        );
    }

    
    printf("Offset\tLatency\n");
    for (int i=0;i<ARRAY_SIZE;i++)
    {
        printf("%d\t%lu\n",results[i].offset,results[i].tsp_end - results[i].tsp_start);
    }
    free (results);
}   

I compile quite simply with gcc microbenchmark.c -o microbenchmark My system configuration is as follows:

CPU : Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz
Operating system : GNU/Linux (Linux 5.4.80-2)

My issue is this:

  • In a single run, all the latencies are similar
  • When repeating the experiment over and over, I don't get results similar to the previous run!

For instance:

In run 1 I get:
Offset  Latency
1   275
2   262
3   262
4   262
5   275
...
252 275
253 275
254 262
255 262

In another run I get:
Offset  Latency
1   75
2   75
3   75
4   72
5   72
...
251 72
252 72
253 75
254 75
255 72

This is pretty surprising (The among-run variation is pretty high, whereas there is negligible within-run variation)! I am not sure how to explain this. What is the issue with my microbenchmark?

Note: I do understand that a normal store would be a write allocate store.. Technically making my measurement that of a load (rather than a store). Also, mfence should flush the store buffer, thereby ensuring that no stores are 'delayed'.

Suraaj K S
  • 600
  • 3
  • 21
  • Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a `-O0` debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. ([Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987)) RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) – Peter Cordes Dec 03 '20 at 15:41

1 Answers1

1

Your warm-up dummy loop only does 1 million iterations, ~6 mil clock cycles in a -O0 debug build - probably not be long enough to get the CPU up to max turbo, on a CPU before Skylake's hardware P-state management. (Idiomatic way of performance evaluation?)

RDTSCP counts fixed-frequency reference cycles, not core clock cycles. Your runs are so short that all the run-to-run variation is probably explained by the CPU frequency being low or high. See How to get the CPU cycle count in x86_64 from C++?


Also, this debug (-O0) build will do extra stores and reloads inside your timed region, but "fortunately" the results[i].offset = i; store plus the mfence before the first rdtscp ensures the result array is also hot in cache before entering the timed region.

Your array is tiny, and you're only doing 1-byte stores (so 64 stores are all in the same cache line.) It's very likely still in MESI Modified state from when you initialized it, so I wouldn't expect an RFO on any of the array[i] = 'y' stores. That already happened for the few lines of stack memory involved before your timed loop. If you want to pre-fault the array without also getting it cached, maybe touch one line per 4k page and leave the other lines untouched. But HW prefetch will get ahead of your stores, especially if you only store 1 byte at a time with 2 slow mfences per store, so again the waiting for off-core memory requests will be outside the timed region. You should expect data to already be in L1d cache or at least L2 in Exclusive state, ready to be flipped to Modified on a store.

BTW, having an offset member seems pointless; it can be implicit from the array index. e.g. print i instead of offset[i]. It's also not very useful to store both start and stop absolute TSC values. You could just store a 32-bit difference, then you wouldn't need to shift / OR in your inline asm, just declare a clobber on the unused EDX output.

Also note that "store latency" typically only matters for performance in real code when mfence is involved. Otherwise the important thing is store->load forwarding, which can happen from the store buffer before the store commits to L1d cache. That's about 6 cycles, or sometimes lower if the reload isn't attempted right away. (It's variable on Sandybridge-family.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • The reason for my offset member is as follows: in the future, maybe I'd want to access the array in a randomised order... The offset member will help then – Suraaj K S Dec 03 '20 at 17:48
  • Also, how does one know the RDTSC tick frequency? A read from some MSR? – Suraaj K S Dec 03 '20 at 17:50
  • @SuraajKS: Oh, I see, it's the `array[]` offset, not the result index. Strange that it's only an `int`, not `size_t`. re: TSC frequency: unfortunately it's very hard to get the CPU to tell you what the reference frequency is. [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) has a link about that, [Determine TSC frequency on Linux](https://stackoverflow.com/q/51919219). – Peter Cordes Dec 03 '20 at 17:53
  • Also [Getting TSC rate in x86 kernel](https://stackoverflow.com/q/35123379) - the easiest way is to use the value your Linux kernel detected on boot, from `dmesg`. e.g. on my i7-6700k - `tsc: Refined TSC clocksource calibration: 4008.000 MHz` – Peter Cordes Dec 03 '20 at 17:57