according to 《How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures》, i use code below:
static inline uint64_t bench_start(void)
{
unsigned cycles_low, cycles_high;
asm volatile("CPUID\n\t"
"RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
: "=r" (cycles_high), "=r" (cycles_low)
::"%rax", "%rbx", "%rcx", "%rdx");
return (uint64_t) cycles_high << 32 | cycles_low;
}
static inline uint64_t bench_end(void)
{
unsigned cycles_low, cycles_high;
asm volatile("RDTSCP\n\t"
"mov %%edx, %0\n\t"
"mov %%eax, %1\n\t"
"CPUID\n\t"
: "=r" (cycles_high), "=r" (cycles_low)
::"%rax", "%rbx", "%rcx", "%rdx");
return (uint64_t) cycles_high << 32 | cycles_low;
}
but in fact, I also see someone use code below:
static inline uint64_t bench_start(void)
{
unsigned cycles_low, cycles_high;
asm_volatile("RDTSCP\n\t"
: "=d" (cycles_high), "=a" (cycles_low));
return (uint64_t) cycles_high << 32 | cycles_low;
}
static inline uint64_t bench_start(void)
{
unsigned cycles_low, cycles_high;
asm_volatile("RDTSCP\n\t"
: "=d" (cycles_high), "=a" (cycles_low));
return (uint64_t) cycles_high << 32 | cycles_low;
}
as you know, RDTSCP is pseudo serializing ,why someone use the second code?two reasons I guess, below:
Maybe in most situation, RDTSCP can ensure complete "in-order exectuion"?
Maybe just want to avoid using CPUID for efficient?