One thread counting, other thread does a job and measurement

Question

I would like to implement a 2 thread model where 1 is counting (infinitely increment a value) and the other one is recording the first counter, do the job, record the second recording and measure the time elapsed between.

Here is what I have done so far:

// global counter
register unsigned long counter asm("r13");
// unsigned long counter;

void* counter_thread(){
    // affinity is set to some isolated CPU so the noise will be minimal

    while(1){
        //counter++; // Line 1*
        asm volatile("add $1, %0" : "+r"(counter) : ); // Line 2*
    }
}

void* measurement_thread(){
    // affinity is set somewhere over here
    unsigned long meas = 0;
    unsigned long a = 5;
    unsigned long r1,r2;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a *=3; // dummy operation that I want to measure
        r2 = counter;
        mfence();
        meas = r2-r1;
        printf("counter:%ld \n", counter);
        break;
    }
}

Let me explain what I have done so far:

Since I want the counter to be accurate, I am setting the affinity to an isolated CPU. Also, If I use the counter in Line 1*, the dissassambled function will be:

 d4c:   4c 89 e8                mov    %r13,%rax
 d4f:   48 83 c0 01             add    $0x1,%rax
 d53:   49 89 c5                mov    %rax,%r13
 d56:   eb f4                   jmp    d4c <counter_thread+0x37>

Which is not 1 cycle operation. That is why I have used inline assembly to decrease 2 mov instructions. Using the inline assembly:

 d4c:   49 83 c5 01             add    $0x1,%r13
 d50:   eb fa                   jmp    d4c <counter_thread+0x37>

But the thing is, both implementations are not working. The other thread cannot see the counter being updated. If I make the global counter value not a register, then it is working, but I want to be precise. If I make global counter value to unsigned long counter then the disassembled code of counter thread is:

 d4c:   48 8b 05 ed 12 20 00    mov    0x2012ed(%rip),%rax        # 202040 <counter>
 d53:   48 83 c0 01             add    $0x1,%rax
 d57:   48 89 05 e2 12 20 00    mov    %rax,0x2012e2(%rip)        # 202040 <counter>
 d5e:   eb ec                   jmp    d4c <counter_thread+0x37>

It works but it doesn't give me the granularity that I want.

EDIT:

My environment:

CPU: AMD Ryzen 3600
kernel: 5.0.0-32-generic
OS: Ubuntu 18.04

EDIT2: I have isolated 2 neighbor CPU cores (i.e. core 10 and 11) and running the experiment on those cores. The counter is on one of the cores, measurement is on the other. Isolation is done by using /etc/default/grub file and adding isolcpus line.

EDIT3: I know that one measurement is not enough. I have run the experiment 10 million times and looked at the results.

Experiment1: Setup:

unsigned long counter =0;//global counter 
void* counter_thread(){
    mfence();
    while(1)
        counter++;
}
void* measurement_thread(){
    unsigned long i=0, r1=0,r2=0;
    unsigned int a=0;
    sleep(1.0);
    while(1){
        mfence();
        r1 = counter;
        a +=3;
        r2 = counter;
        mfence();
        measurements[r2-r1]++;
        i++;
        if(i == MILLION_ITER)
            break;   
    }
}

Results1: In 99.99% I got 0. Which I expect because either first thread is not running, or OS or other interrupts disturb the measurement. Getting rid of the 0's and very high values gives me 20 cycles of measurement on the average. (I was expecting 3-4 because I only do an integer addition).

Experiment2:

Setup: Identically the same as above, one difference is, instead of global counter, I use the counter as register:

register unsigned long counter asm("r13");

Results2: Measurement thread always reads 0. In disassembled code, I can see that both are dealing with R13 register (counter), however, I believe that it is not somehow shared.

Experiment3:

Setup: Identical to the setup2, except in the counter thread, instead of doing counter++, I am doing an inline assembly to make sure that I am doing 1 cycle operation. My disassembled file looks like this:

 cd1:   49 83 c5 01             add    $0x1,%r13
 cd5:   eb fa                   jmp    cd1 <counter_thread+0x37>

Results3: Measurement thread reads 0 as above.

I wouldn't try to use a register to share data between two threads. Maybe the context switch clears or saves and restores the registers. Maybe the counting thread does not run while your other thread is active. If you want to do time measurements, use a system timer function (systick) or access some timer hardware. — Bodo, Nov 11 '19 at 13:39
I believe `systick` is for ARM, and my machine is AMD. x86 has `rdtscp` for reading the time stamp counter, however, AMD does not give cycle accuracy. My machine always gives me multiple of 36 cycles. I need to measure 2-4 cycle resolution, so the instruction is useless for me. Another problem is, measurement thread sees counter as zero. I am not talking about the difference(r2-r1). If I use register, meas.thread always sees 0 as the counter value — tzq71871, Nov 11 '19 at 13:45
I'm confused. You were expecting the counting CPU's r13 register value to be mirrored to the other CPU's r13 register? — Ian Abbott, Nov 11 '19 at 13:53
If you want specific answers, write details about your environment (CPU, OS). The term "systick" was meant as a generic term or example. Does your OS guarantee that your counting thread will be running all the time? Even if it is not suspended for the execution of other threads it may not count evenly if interrupts occur. Please tell more details about what you want to measure. If the timer resolution is not good enough, you can measure the time to run if e.g. 100 times. Please [edit] your question and add all information there instead of answering in comments. — Bodo, Nov 11 '19 at 13:54
`sleep(1.0)` looks fishy. Do you get sleep(0) or sleep(1) from this? — Lundin, Nov 11 '19 at 14:18
@Lundin: the prototype for [`sleep(unsigned int)`](http://man7.org/linux/man-pages/man3/sleep.3.html) will coerce the double `1.0` to integer `1`. Very fishy and misleading, but not immediately a problem. — Peter Cordes, Nov 11 '19 at 14:21
@IanAbbott, I want them both to use a register for counting, instead of going to the stack/cache and pick that value from there. — tzq71871, Nov 11 '19 at 14:22
@tzq71871 Yes, but each CPU core has its own set of registers. So the r13 that the "counting" CPU writes is not the same as the r13 that the "measurement" CPU reads. — Ian Abbott, Nov 11 '19 at 15:33

Peter Cordes · Answer 1 · 2019-11-11T14:43:26.777

Each thread has its own registers. Each logical CPU core has its own architectural registers which a thread uses when running on a core. Only signal handlers (or on bare metal, interrupts) can modify the registers of their thread.

Declaring a GNU C asm register-global like your ... asm("r13") in a multi-threaded program effectively gives you thread-local storage, not a truly shared global.

Only memory is shared between threads, not registers. This is how multiple threads can run at the same time without stepping on each other, each using their registers.

Registers that you don't declare as register-global can be used freely by the compiler, so it wouldn't work at all for them to be shared between cores. (And there's nothing GCC can do to make them shared vs. private depending on how you declare them.)

Even apart from that, the register global isn't volatile or atomic so r1 = counter; and r2 = counter; can CSE so r2-r1 is a compile-time-constant zero even if your local R13 was changing from a signal handler.

How can I make sure that both of the threads are using registers for read/write operation of the counter value?

You can't do that. There is no shared state between cores that can be read/written with lower latency than cache.

If you want to time something, consider using rdtsc to get reference cycles, or rdpmc to read a performance counter (which you might have set up to be counting core clock cycles).

Using another thread to increment a counter is unnecessary, and not helpful because there's no very-low-overhead way to read something from another core.

The rdtscp instruction in my machine gives 36-72-108... cycle resolution at best. So, I cannot distinguish the difference between 2 cycles and 35 cycles because both of them will give 36 cycles.

Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.

But more importantly, you can't usefully measure a *=3; in C in terms of a single cost in cycles. First of all, it can compile differently depending on context.

But assuming a normal lea eax, [rax + rax*2], a realistic instruction cost model has 3 dimensions: uop count (front end), back-end port pressure, and latency from input(s) to output. https://agner.org/optimize/

See my answer on RDTSCP in NASM always returns the same value for more about timing a single instruction. Put it in a loop in different ways to measure throughput vs. latency, and look at perf counters to get uops->ports. Or look at Agner Fog's instruction tables and https://uops.info/ because people have already done those test.

Also

Again, these are how you time a single asm instruction, not a C statement. With optimization enabled the cost of a C statement can depend on how it optimizes into the surrounding code. (And/or whether latency of surrounding operations hides its cost, on an out-of-order execution CPU like all modern x86 CPUs.)

How can I make sure that both of the threads are using registers for read/write operation of the counter value? They don't need to be the same(r13). I just want to decrease the time for going to the cache/memory — tzq71871, Nov 11 '19 at 14:23
@tzq71871: You can't. If you want to time something, consider using `rdtsc` to get reference cycles, or `rdpmc` to read a performance counter (which you might have set up to be counting core clock cycles). [How to get the CPU cycle count in x86\_64 from C++?](//stackoverflow.com/q/13772567) — Peter Cordes, Nov 11 '19 at 14:26
The `rdtscp` instruction in my machine gives 36-72-108... cycle resolution at best. So, I cannot distinguish the difference between 2 cycles and 35 cycles because both of them will give 36 cycles. I will look into performance monitor counters and try to see whether there is a pmc that I can use — tzq71871, Nov 11 '19 at 14:30
@tzq71871: `rdpmc` is somewhat lower overhead. But if you carefully use `lfence` around your timed region and around `rdtsc`, you can subtract the constant overhead. But note that you can't usefully measure `a *=3;` in C in terms of a single cost in cycles. First of all, it can compile differently depending on context. But assuming a normal `lea eax, [rax + rax*2]`, a realistic instruction cost model has 3 dimensions: uop count (front end), back-end port pressure, and latency from input(s) to output. https://agner.org/optimize/ — Peter Cordes, Nov 11 '19 at 14:39

tzq71871 · Answer 2 · 2019-11-12T15:41:05.497

Then you're using rdtsc wrong. It's not serializing so you need lfence around the timed region. See my answer on How to get the CPU cycle count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is only somewhat lower overhead.

Ok. I did my homework.

First things first. I knew that rdtscp is serialized instruction. I am not talking about rdtsc, there is a P letter at the end.

I have checked both Intel and AMD manuals for that.

Correct me if I am wrong but, from what I read, I understand that I don't need fence instructions before and after rdtscp, because it is a serializing instruction, right?

Second thing is, I did run some experiments on 3 of my machines. Here are the results

Ryzen experiments

======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution. 
I also checked where the sums are not divisible by 36. 
This is the case where I don't use fence instructions with rdtsc. 
It turns out that the read value is either 35, or 1, 
which I believe the instruction(rdtsc) cannot read the value correctly.

Mfenced rtdscP reads:
    Sum:            25884432
    Avg:            258
    Sum, removed outliers:  25800120
    Avg, removed outliers:  258
Mfenced rtdsc reads:
    Sum:            17579196
    Avg:            175
    Sum, removed outliers:  17577684
    Avg, removed outliers:  175
Lfenced rtdscP reads:
    Sum:            7511688
    Avg:            75
    Sum, removed outliers:  7501608
    Avg, removed outliers:  75
Lfenced rtdsc reads:
    Sum:            7024428
    Avg:            70
    Sum, removed outliers:  7015248
    Avg, removed outliers:  70
NOT fenced rtdscP reads:
    Sum:            6024888
    Avg:            60
    Sum, removed outliers:  6024888
    Avg, removed outliers:  60
NOT fenced rtdsc reads:
    Sum:            3274866
    Avg:            32
    Sum, removed outliers:  3232913
    Avg, removed outliers:  35

======================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            36217404
    Avg:            362
    Sum, removed outliers:  36097164
    Avg, removed outliers:  361
Mfenced rtdsc reads:
    Sum:            22973400
    Avg:            229
    Sum, removed outliers:  22939236
    Avg, removed outliers:  229
Lfenced rtdscP reads:
    Sum:            13178196
    Avg:            131
    Sum, removed outliers:  13177872
    Avg, removed outliers:  131
Lfenced rtdsc reads:
    Sum:            12631932
    Avg:            126
    Sum, removed outliers:  12631932
    Avg, removed outliers:  126
NOT fenced rtdscP reads:
    Sum:            12115548
    Avg:            121
    Sum, removed outliers:  12103236
    Avg, removed outliers:  121
NOT fenced rtdsc reads:
    Sum:            3335997
    Avg:            33
    Sum, removed outliers:  3305333
    Avg, removed outliers:  35

=================== END OF AMD RYZEN EXPERIMENTS =========================

And here is the bulldozer architecture experiments.

======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;

Mfenced rtdscP reads:
    Sum:            32120355
    Avg:            321
    Sum, removed outliers:  27718117
    Avg, removed outliers:  278
Mfenced rtdsc reads:
    Sum:            23739715
    Avg:            237
    Sum, removed outliers:  23013028
    Avg, removed outliers:  230
Lfenced rtdscP reads:
    Sum:            14274916
    Avg:            142
    Sum, removed outliers:  13026199
    Avg, removed outliers:  131
Lfenced rtdsc reads:
    Sum:            11083963
    Avg:            110
    Sum, removed outliers:  10905271
    Avg, removed outliers:  109
NOT fenced rtdscP reads:
    Sum:            9361738
    Avg:            93
    Sum, removed outliers:  8993886
    Avg, removed outliers:  90
NOT fenced rtdsc reads:
    Sum:            4766349
    Avg:            47
    Sum, removed outliers:  4310312
    Avg, removed outliers:  43


=================================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            38748536
    Avg:            387
    Sum, removed outliers:  36719312
    Avg, removed outliers:  368
Mfenced rtdsc reads:
    Sum:            35106459
    Avg:            351
    Sum, removed outliers:  33514331
    Avg, removed outliers:  335
Lfenced rtdscP reads:
    Sum:            23867349
    Avg:            238
    Sum, removed outliers:  23203849
    Avg, removed outliers:  232
Lfenced rtdsc reads:
    Sum:            21991975
    Avg:            219
    Sum, removed outliers:  21394828
    Avg, removed outliers:  215
NOT fenced rtdscP reads:
    Sum:            19790942
    Avg:            197
    Sum, removed outliers:  19701909
    Avg, removed outliers:  197
NOT fenced rtdsc reads:
    Sum:            10841074
    Avg:            108
    Sum, removed outliers:  10583085
    Avg, removed outliers:  106

=================== END OF AMD BULLDOZER EXPERIMENTS =========================

Intel results are:

======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration

Using a *=3

Mfenced rtdscP reads:
    Sum:            10914893
    Avg:            109
    Sum, removed outliers:  10820879
    Avg, removed outliers:  108
Mfenced rtdsc reads:
    Sum:            7866322
    Avg:            78
    Sum, removed outliers:  7606613
    Avg, removed outliers:  76
Lfenced rtdscP reads:
    Sum:            4823705
    Avg:            48
    Sum, removed outliers:  4783842
    Avg, removed outliers:  47
Lfenced rtdsc reads:
    Sum:            3634106
    Avg:            36
    Sum, removed outliers:  3463079
    Avg, removed outliers:  34
NOT fenced rtdscP reads:
    Sum:            2216884
    Avg:            22
    Sum, removed outliers:  1435830
    Avg, removed outliers:  17
NOT fenced rtdsc reads:
    Sum:            1736640
    Avg:            17
    Sum, removed outliers:  986250
    Avg, removed outliers:  12

===================================================================
Using 3 dependent floating point divisions:

div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;

Mfenced rtdscP reads:
    Sum:            22008705
    Avg:            220
    Sum, removed outliers:  16097871
    Avg, removed outliers:  177
Mfenced rtdsc reads:
    Sum:            13086713
    Avg:            130
    Sum, removed outliers:  12627094
    Avg, removed outliers:  126
Lfenced rtdscP reads:
    Sum:            9882409
    Avg:            98
    Sum, removed outliers:  9753927
    Avg, removed outliers:  97
Lfenced rtdsc reads:
    Sum:            8854943
    Avg:            88
    Sum, removed outliers:  8435847
    Avg, removed outliers:  84
NOT fenced rtdscP reads:
    Sum:            7302577
    Avg:            73
    Sum, removed outliers:  7190424
    Avg, removed outliers:  71
NOT fenced rtdsc reads:
    Sum:            1726126
    Avg:            17
    Sum, removed outliers:  1029630
    Avg, removed outliers:  12

=================== END OF INTEL EXPERIMENTS =========================

From my point of view, AMD Ryzen should've executed faster. My Intel CPU is almost 5 years old and the AMD CPU is brand new.

I couldn't find the exact source, but, I have read that AMD changed/ decreased the resolution of rdtsc and rdtscp instruction while they are updating the architecture from Bulldozer to Ryzen. That is why I get multiple of 36 results when I try to measure the timing of the code. I don't know why they did or where did I find the information, but it is the case. If you have a AMD ryzen machine, I would suggest you to run the experiments and see the timer outputs.

I didn't look at rdpmc yet, I'll try to update when I read it.

EDIT:

Following up to the comments below.

About warming up: All experiments are just 1 C code. So, even if they are not warmed up in mfenced rdtscp (the first experiment), they surely are warmed up later.

I am using c and inline assembly mixed. I just use gcc main.c -o main to compile the code. AFAIK, it compiles using O0 optimization. gcc is version 7.4.0

Even to decrease the time, I declared my function as #define so that they won't be called from the function, which means faster execution.

An example code for how I did the experiments:

#define lfence() asm volatile("lfence\n");
#define mfence() asm volatile("mfence\n");
// reading the low end is enough for the measurement because I don't measure too complex result. 
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscp\n": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
    uint64_t scores[MEASUREMENT_ITERATION] = {0};
    printf("Mfenced rtdscP reads:\n");
    initvars();
    for(int i = 0; i < MEASUREMENT_ITERATION; i++){
        mfence();
        rdtscp(read1);
        mfence();
        calculation_to_measure();
        mfence();
        rdtscp(read2);
        mfence();
        scores[i] = read2-read1;
        initvars();
    }
    calculate_sum_avg(scores);
}

EDIT2:

Why are you using mfence?

I wasn't using mfence at the first place. I was just using rdtscp, do work, rdtscp again to find the difference.

No idea what you're hoping to learn here by cycle-accurate timing of anti-optimized gcc -O0 output.

I am not using any optimization because I want to measure how many cycles would take instruction to be finished. I will measure code block which includes branches. If I use optimization, the optimization might change it to condmove and that would break the whole point of the measurement.

I wouldn't be surprised if the non-inline function call and other memory access (from disabling optimization, /facepalm) being mfenced is what makes it a multiple of 36 on your Ryzen.

Also, below, it is the disassembled version of the code. During the measurements, there is no memory access(except read1 and read2, which I believe they are in the cache) or call to other functions.

 9fd:   0f ae f0                mfence 
 a00:   0f 01 f9                rdtscp 
 a03:   48 89 05 36 16 20 00    mov    %rax,0x201636(%rip)        # 202040 <read1>
 a0a:   0f ae f0                mfence 
 a0d:   8b 05 15 16 20 00       mov    0x201615(%rip),%eax        # 202028 <a21>
 a13:   83 c0 03                add    $0x3,%eax #Either this or division operations for measurement
 a16:   89 05 0c 16 20 00       mov    %eax,0x20160c(%rip)        # 202028 <a21>
 a1c:   0f ae f0                mfence 
 a1f:   0f 01 f9                rdtscp 
 a22:   48 89 05 0f 16 20 00    mov    %rax,0x20160f(%rip)        # 202038 <read2>
 a29:   0f ae f0                mfence 
 a2c:   48 8b 15 05 16 20 00    mov    0x201605(%rip),%rdx        # 202038 <read2>
 a33:   48 8b 05 06 16 20 00    mov    0x201606(%rip),%rax        # 202040 <read1>
 a3a:   48 29 c2                sub    %rax,%rdx
 a3d:   8b 85 ec ca f3 ff       mov    -0xc3514(%rbp),%eax

Yes sorry, I misread `rdtsc` instead of `rdtscp`. But it's *not* a serializing instruction. It's only defined as one-way barrier. (Also, "serializing" means flushing the store buffer, too, but you don't want or need that). In practice `rdtscp` is probably implemented like `lfence; rdtsc`. At the start of a timed region you might still want `lfence; rdtsc; lfence`. — Peter Cordes, Nov 12 '19 at 15:01
Interesting results on AMD. Are you sure your CPU clocks were "warmed up"? Remember that the TSC only counts at a fixed "reference" frequency, but at idle the CPU clock speed is much lower. On Intel CPUs the TSC is usually approximately the max non-turbo clock speed, but I don't know what AMD does. It's possible they might increment by 36 on your specific CPU; did you try to calibrate your RDTSCP using a known-speed loop like a chain of dependent `add` instructions that will run at 1 cycle per `add` to see how many TSC counts per core clock cycle you see? — Peter Cordes, Nov 12 '19 at 15:04
Wait, you're still writing this in pure C? How did you compile (compiler version/options)? Is optimization enabled? If not, there might be a store in your timed region, not just a fast `lea`. What asm is actually running for the timed region? Anyway, I don't have an AMD CPU, only Skylake (and some older Intel sitting around.) — Peter Cordes, Nov 12 '19 at 15:06
Ok yes, 100k iterations should be enough to wam up the CPU. Why are you using `mfence`? That *does* force the store buffer to be flushed to L1d cache. On AMD it's also serializing on the instruction stream, but on some Intel CPUs `mfence` doesn't block out-of-order execution of non-memory instructions. I wouldn't be surprised if the non-inline function call and other memory access (from disabling optimization, /facepalm) being mfenced is what makes it a multiple of 36 on your Ryzen. No idea what you're hoping to learn here by cycle-accurate timing of anti-optimized gcc `-O0` output. — Peter Cordes, Nov 12 '19 at 15:27
Oh, `calculation_to_measure()` isn't a function after all? Ok that's less terrible, but you're still measuring time to drain the store buffer which is *not* normally part of the real cost. Store forwarding shortcuts that when you reload a store result. `lfence` would make some sense here but `mfence` makes no sense. You claim *During the measurements, there is no memory access* but look at the asm: it's keeping `a21` in memory in a global variable which gets loaded into EAX and then stored after the `add`. You could use a `register int a21` local variable, or enable at least `-Og` or -O1. — Peter Cordes, Nov 12 '19 at 15:52
`calculation_to_measure()` is outside of the main part. I am just calculating the sums and averages of the array. It is not/should not affecting the measurements because score array is changed during operations and I am not updating it once again in the `calculation_to_measure` function. I'll try do the `register int` and try to calculate the new results. — tzq71871, Nov 12 '19 at 15:59
I couldn't edit the comment above. `calculation_to_measure` is also another `#define` function. I misread it to `calculate_sum_avg` — tzq71871, Nov 12 '19 at 16:13
Yup, I could tell it was a `#define` of `+=3` on a global or static variable from looking at the un-optimized asm. That's what I was commenting about. `register int foo` will only work if it's a local, though. — Peter Cordes, Nov 12 '19 at 16:17
I added the different optimization for register values. Again, the point does not change. AMD machine gives 36 cycle resolution during measurements and I want 2-4 cycle accuracy to exactly measure the timing. On the other hand, Intel gives better resolution with the same code. — tzq71871, Nov 12 '19 at 16:29

score 0 · Answer 3 · answered Nov 12 '19 at 16:27

The code:

register unsigned long a21 asm("r13");

#define calculation_to_measure(){\
    a21 +=3;\
}
#define initvars(){\
    read1 = 0;\
    read2 = 0;\
    a21= 21;\
}
// =========== RDTSCP, double mfence ================
// Reference code, others are similar
void rdtscp_doublemfence(){
    uint64_t scores[MEASUREMENT_ITERATION] = {0};
    printf("Mfenced rtdscP reads:\n");
    initvars();
    for(int i = 0; i < MEASUREMENT_ITERATION; i++){
        mfence();
        rdtscp(read1);
        mfence();
        calculation_to_measure();
        mfence();
        rdtscp(read2);
        mfence();
        scores[i] = read2-read1;
        initvars();
    }
    calculate_sum_avg(scores);
}

Results, I only did those in AMD Ryzen machine.|

Using gcc main.c -O0 -o rdtsc , no optimization. It moves r13 to rax.

Dissassembled code:

 9ac:   0f ae f0                mfence 
 9af:   0f 01 f9                rdtscp 
 9b2:   48 89 05 7f 16 20 00    mov    %rax,0x20167f(%rip)        # 202038 <read1>
 9b9:   0f ae f0                mfence 
 9bc:   4c 89 e8                mov    %r13,%rax
 9bf:   48 83 c0 03             add    $0x3,%rax
 9c3:   49 89 c5                mov    %rax,%r13
 9c6:   0f ae f0                mfence 
 9c9:   0f 01 f9                rdtscp 
 9cc:   48 89 05 5d 16 20 00    mov    %rax,0x20165d(%rip)        # 202030 <read2>
 9d3:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            32846796
    Avg:            328
    Sum, removed outliers:  32626008
    Avg, removed outliers:  327
Mfenced rtdsc reads:
    Sum:            18235980
    Avg:            182
    Sum, removed outliers:  18108180
    Avg, removed outliers:  181
Lfenced rtdscP reads:
    Sum:            14351508
    Avg:            143
    Sum, removed outliers:  14238432
    Avg, removed outliers:  142
Lfenced rtdsc reads:
    Sum:            11179368
    Avg:            111
    Sum, removed outliers:  10994400
    Avg, removed outliers:  115
NOT fenced rtdscP reads:
    Sum:            6064488
    Avg:            60
    Sum, removed outliers:  6064488
    Avg, removed outliers:  60
NOT fenced rtdsc reads:
    Sum:            3306394
    Avg:            33
    Sum, removed outliers:  3278450
    Avg, removed outliers:  35

Using gcc main.c -Og -o rdtsc_global

Dissassembled code:

 934:   0f ae f0                mfence 
 937:   0f 01 f9                rdtscp 
 93a:   48 89 05 f7 16 20 00    mov    %rax,0x2016f7(%rip)        # 202038 <read1>
 941:   0f ae f0                mfence 
 944:   49 83 c5 03             add    $0x3,%r13
 948:   0f ae f0                mfence 
 94b:   0f 01 f9                rdtscp 
 94e:   48 89 05 db 16 20 00    mov    %rax,0x2016db(%rip)        # 202030 <read2>
 955:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            22819428
    Avg:            228
    Sum, removed outliers:  22796064
    Avg, removed outliers:  227
Mfenced rtdsc reads:
    Sum:            20630736
    Avg:            206
    Sum, removed outliers:  19937664
    Avg, removed outliers:  199
Lfenced rtdscP reads:
    Sum:            13375008
    Avg:            133
    Sum, removed outliers:  13374144
    Avg, removed outliers:  133
Lfenced rtdsc reads:
    Sum:            9840312
    Avg:            98
    Sum, removed outliers:  9774036
    Avg, removed outliers:  97
NOT fenced rtdscP reads:
    Sum:            8784684
    Avg:            87
    Sum, removed outliers:  8779932
    Avg, removed outliers:  87
NOT fenced rtdsc reads:
    Sum:            3274209
    Avg:            32
    Sum, removed outliers:  3255480
    Avg, removed outliers:  36

Using o1 optimization: gcc main.c -O1 -o rdtsc_o1

Dissassembled code:

 a89:   0f ae f0                mfence 
 a8c:   0f 31                   rdtsc  
 a8e:   48 89 05 a3 15 20 00    mov    %rax,0x2015a3(%rip)        # 202038 <read1>
 a95:   0f ae f0                mfence 
 a98:   49 83 c5 03             add    $0x3,%r13
 a9c:   0f ae f0                mfence 
 a9f:   0f 31                   rdtsc  
 aa1:   48 89 05 88 15 20 00    mov    %rax,0x201588(%rip)        # 202030 <read2>
 aa8:   0f ae f0                mfence

Results:

Mfenced rtdscP reads:
    Sum:            28041804
    Avg:            280
    Sum, removed outliers:  27724464
    Avg, removed outliers:  277
Mfenced rtdsc reads:
    Sum:            17936460
    Avg:            179
    Sum, removed outliers:  17931024
    Avg, removed outliers:  179
Lfenced rtdscP reads:
    Sum:            7110144
    Avg:            71
    Sum, removed outliers:  7110144
    Avg, removed outliers:  71
Lfenced rtdsc reads:
    Sum:            6691140
    Avg:            66
    Sum, removed outliers:  6672924
    Avg, removed outliers:  66
NOT fenced rtdscP reads:
    Sum:            5970888
    Avg:            59
    Sum, removed outliers:  5965236
    Avg, removed outliers:  59
NOT fenced rtdsc reads:
    Sum:            3402920
    Avg:            34
    Sum, removed outliers:  3280111
    Avg, removed outliers:  35

`read1` and `read2` don't need to be globals. GCC ends up storing the rdtsc result to memory before mfence, inside the timed region. That's horrible and defeats most of the purpose of using `register int a21`. If you used locals, GCC could keep them in a register. BTW, `-Og` and `-O1` made the same asm for the timed region; it's just clutter to include both results separately. (Except you showed different versions of — Peter Cordes, Nov 12 '19 at 16:42
Ok. I'll try to see the difference if I put reads in the local scope. One more thing that I need to ask. As I am trying to get almost a cycle accurate resolution, where should I store the `counter` value, if I use 2 threads to count? If it is in the global variable, then as you said, there would be no purpose to do measurement. Should I declare it in `main` and send `counter` to the functions as arguments? — tzq71871, Nov 12 '19 at 16:55
Your counter could be a local inside `rdtscp_doublemfence`. A register-asm global is ok but unnecessary. `read1` and `read2` should *definitely* be locals. They don't have any long-term meaning. — Peter Cordes, Nov 12 '19 at 17:09

One thread counting, other thread does a job and measurement

3 Answers3

Linked