Then you're using rdtsc wrong. It's not serializing so you need lfence
around the timed region. See my answer on How to get the CPU cycle
count in x86_64 from C++?. But yes, rdtsc is expensive, and rdpmc is
only somewhat lower overhead.
Ok. I did my homework.
First things first. I knew that rdtscp
is serialized instruction. I am not talking about rdtsc
, there is a P
letter at the end.
I have checked both Intel and AMD manuals for that.
Correct me if I am wrong but, from what I read, I understand that I don't need fence
instructions before and after rdtscp
, because it is a serializing instruction, right?
Second thing is, I did run some experiments on 3 of my machines. Here are the results
Ryzen experiments
======================= AMD RYZEN EXPERIMENTS =========================
RYZEN 3600
100_000 iteration
Using a *=3
Not that, almost all sums are divisible by 36, which is my machine's timer resolution.
I also checked where the sums are not divisible by 36.
This is the case where I don't use fence instructions with rdtsc.
It turns out that the read value is either 35, or 1,
which I believe the instruction(rdtsc) cannot read the value correctly.
Mfenced rtdscP reads:
Sum: 25884432
Avg: 258
Sum, removed outliers: 25800120
Avg, removed outliers: 258
Mfenced rtdsc reads:
Sum: 17579196
Avg: 175
Sum, removed outliers: 17577684
Avg, removed outliers: 175
Lfenced rtdscP reads:
Sum: 7511688
Avg: 75
Sum, removed outliers: 7501608
Avg, removed outliers: 75
Lfenced rtdsc reads:
Sum: 7024428
Avg: 70
Sum, removed outliers: 7015248
Avg, removed outliers: 70
NOT fenced rtdscP reads:
Sum: 6024888
Avg: 60
Sum, removed outliers: 6024888
Avg, removed outliers: 60
NOT fenced rtdsc reads:
Sum: 3274866
Avg: 32
Sum, removed outliers: 3232913
Avg, removed outliers: 35
======================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 36217404
Avg: 362
Sum, removed outliers: 36097164
Avg, removed outliers: 361
Mfenced rtdsc reads:
Sum: 22973400
Avg: 229
Sum, removed outliers: 22939236
Avg, removed outliers: 229
Lfenced rtdscP reads:
Sum: 13178196
Avg: 131
Sum, removed outliers: 13177872
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 12631932
Avg: 126
Sum, removed outliers: 12631932
Avg, removed outliers: 126
NOT fenced rtdscP reads:
Sum: 12115548
Avg: 121
Sum, removed outliers: 12103236
Avg, removed outliers: 121
NOT fenced rtdsc reads:
Sum: 3335997
Avg: 33
Sum, removed outliers: 3305333
Avg, removed outliers: 35
=================== END OF AMD RYZEN EXPERIMENTS =========================
And here is the bulldozer architecture experiments.
======================= AMD BULLDOZER EXPERIMENTS =========================
AMD A6-4455M
100_000 iteration
Using a *=3;
Mfenced rtdscP reads:
Sum: 32120355
Avg: 321
Sum, removed outliers: 27718117
Avg, removed outliers: 278
Mfenced rtdsc reads:
Sum: 23739715
Avg: 237
Sum, removed outliers: 23013028
Avg, removed outliers: 230
Lfenced rtdscP reads:
Sum: 14274916
Avg: 142
Sum, removed outliers: 13026199
Avg, removed outliers: 131
Lfenced rtdsc reads:
Sum: 11083963
Avg: 110
Sum, removed outliers: 10905271
Avg, removed outliers: 109
NOT fenced rtdscP reads:
Sum: 9361738
Avg: 93
Sum, removed outliers: 8993886
Avg, removed outliers: 90
NOT fenced rtdsc reads:
Sum: 4766349
Avg: 47
Sum, removed outliers: 4310312
Avg, removed outliers: 43
=================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 38748536
Avg: 387
Sum, removed outliers: 36719312
Avg, removed outliers: 368
Mfenced rtdsc reads:
Sum: 35106459
Avg: 351
Sum, removed outliers: 33514331
Avg, removed outliers: 335
Lfenced rtdscP reads:
Sum: 23867349
Avg: 238
Sum, removed outliers: 23203849
Avg, removed outliers: 232
Lfenced rtdsc reads:
Sum: 21991975
Avg: 219
Sum, removed outliers: 21394828
Avg, removed outliers: 215
NOT fenced rtdscP reads:
Sum: 19790942
Avg: 197
Sum, removed outliers: 19701909
Avg, removed outliers: 197
NOT fenced rtdsc reads:
Sum: 10841074
Avg: 108
Sum, removed outliers: 10583085
Avg, removed outliers: 106
=================== END OF AMD BULLDOZER EXPERIMENTS =========================
Intel results are:
======================= INTEL EXPERIMENTS =========================
INTEL 4710HQ
100_000 iteration
Using a *=3
Mfenced rtdscP reads:
Sum: 10914893
Avg: 109
Sum, removed outliers: 10820879
Avg, removed outliers: 108
Mfenced rtdsc reads:
Sum: 7866322
Avg: 78
Sum, removed outliers: 7606613
Avg, removed outliers: 76
Lfenced rtdscP reads:
Sum: 4823705
Avg: 48
Sum, removed outliers: 4783842
Avg, removed outliers: 47
Lfenced rtdsc reads:
Sum: 3634106
Avg: 36
Sum, removed outliers: 3463079
Avg, removed outliers: 34
NOT fenced rtdscP reads:
Sum: 2216884
Avg: 22
Sum, removed outliers: 1435830
Avg, removed outliers: 17
NOT fenced rtdsc reads:
Sum: 1736640
Avg: 17
Sum, removed outliers: 986250
Avg, removed outliers: 12
===================================================================
Using 3 dependent floating point divisions:
div1 = div1 / read1;
div2 = div2 / div1;
div3 = div3 / div2;
Mfenced rtdscP reads:
Sum: 22008705
Avg: 220
Sum, removed outliers: 16097871
Avg, removed outliers: 177
Mfenced rtdsc reads:
Sum: 13086713
Avg: 130
Sum, removed outliers: 12627094
Avg, removed outliers: 126
Lfenced rtdscP reads:
Sum: 9882409
Avg: 98
Sum, removed outliers: 9753927
Avg, removed outliers: 97
Lfenced rtdsc reads:
Sum: 8854943
Avg: 88
Sum, removed outliers: 8435847
Avg, removed outliers: 84
NOT fenced rtdscP reads:
Sum: 7302577
Avg: 73
Sum, removed outliers: 7190424
Avg, removed outliers: 71
NOT fenced rtdsc reads:
Sum: 1726126
Avg: 17
Sum, removed outliers: 1029630
Avg, removed outliers: 12
=================== END OF INTEL EXPERIMENTS =========================
From my point of view, AMD Ryzen should've executed faster. My Intel CPU is almost 5 years old and the AMD CPU is brand new.
I couldn't find the exact source, but, I have read that AMD changed/ decreased the resolution of rdtsc
and rdtscp
instruction while they are updating the architecture from Bulldozer to Ryzen. That is why I get multiple of 36 results when I try to measure the timing of the code. I don't know why they did or where did I find the information, but it is the case. If you have a AMD ryzen machine, I would suggest you to run the experiments and see the timer outputs.
I didn't look at rdpmc
yet, I'll try to update when I read it.
EDIT:
Following up to the comments below.
About warming up: All experiments are just 1 C code. So, even if they are not warmed up in mfenced rdtscp
(the first experiment), they surely are warmed up later.
I am using c
and inline assembly
mixed. I just use gcc main.c -o main
to compile the code. AFAIK, it compiles using O0 optimization. gcc is version 7.4.0
Even to decrease the time, I declared my function as #define
so that they won't be called from the function, which means faster execution.
An example code for how I did the experiments:
#define lfence() asm volatile("lfence\n");
#define mfence() asm volatile("mfence\n");
// reading the low end is enough for the measurement because I don't measure too complex result.
// For complex measurements, I need to shift and OR
#define rdtscp(_readval) asm volatile("rdtscp\n": "=a"(_readval)::"rcx", "rdx");
void rdtscp_doublemfence(){
uint64_t scores[MEASUREMENT_ITERATION] = {0};
printf("Mfenced rtdscP reads:\n");
initvars();
for(int i = 0; i < MEASUREMENT_ITERATION; i++){
mfence();
rdtscp(read1);
mfence();
calculation_to_measure();
mfence();
rdtscp(read2);
mfence();
scores[i] = read2-read1;
initvars();
}
calculate_sum_avg(scores);
}
EDIT2:
Why are you using mfence?
I wasn't using mfence
at the first place. I was just using rdtscp
, do work, rdtscp
again to find the difference.
No idea what you're hoping to learn here by cycle-accurate timing of anti-optimized gcc -O0 output.
I am not using any optimization because I want to measure how many cycles would take instruction to be finished. I will measure code block which includes branches. If I use optimization, the optimization might change it to condmove
and that would break the whole point of the measurement.
I wouldn't be surprised if the non-inline function call and other memory access (from disabling optimization, /facepalm) being mfenced is what makes it a multiple of 36 on your Ryzen.
Also, below, it is the disassembled version of the code. During the measurements, there is no memory access(except read1 and read2, which I believe they are in the cache) or call to other functions.
9fd: 0f ae f0 mfence
a00: 0f 01 f9 rdtscp
a03: 48 89 05 36 16 20 00 mov %rax,0x201636(%rip) # 202040 <read1>
a0a: 0f ae f0 mfence
a0d: 8b 05 15 16 20 00 mov 0x201615(%rip),%eax # 202028 <a21>
a13: 83 c0 03 add $0x3,%eax #Either this or division operations for measurement
a16: 89 05 0c 16 20 00 mov %eax,0x20160c(%rip) # 202028 <a21>
a1c: 0f ae f0 mfence
a1f: 0f 01 f9 rdtscp
a22: 48 89 05 0f 16 20 00 mov %rax,0x20160f(%rip) # 202038 <read2>
a29: 0f ae f0 mfence
a2c: 48 8b 15 05 16 20 00 mov 0x201605(%rip),%rdx # 202038 <read2>
a33: 48 8b 05 06 16 20 00 mov 0x201606(%rip),%rax # 202040 <read1>
a3a: 48 29 c2 sub %rax,%rdx
a3d: 8b 85 ec ca f3 ff mov -0xc3514(%rbp),%eax