how to use rdtscp correctly?

Question

according to 《How to Benchmark Code Execution Times on Intel® IA-32 and IA-64 Instruction Set Architectures》, i use code below:

static inline uint64_t bench_start(void)
{
    unsigned cycles_low, cycles_high;
    asm volatile("CPUID\n\t"
        "RDTSCP\n\t"
        "mov %%edx, %0\n\t"
        "mov %%eax, %1\n\t"
        : "=r" (cycles_high), "=r" (cycles_low)
        ::"%rax", "%rbx", "%rcx", "%rdx");

    return (uint64_t) cycles_high << 32 | cycles_low;
}

static inline uint64_t bench_end(void)
{
     unsigned cycles_low, cycles_high;
     asm volatile("RDTSCP\n\t"
         "mov %%edx, %0\n\t"
         "mov %%eax, %1\n\t"
         "CPUID\n\t"
         : "=r" (cycles_high), "=r" (cycles_low)
         ::"%rax", "%rbx", "%rcx", "%rdx");
     return (uint64_t) cycles_high << 32 | cycles_low;
}

but in fact, I also see someone use code below:

static inline uint64_t bench_start(void)
{
   unsigned cycles_low, cycles_high;
   asm_volatile("RDTSCP\n\t"
                : "=d" (cycles_high), "=a" (cycles_low));
   return (uint64_t) cycles_high << 32 | cycles_low;
}

static inline uint64_t bench_start(void)
{
   unsigned cycles_low, cycles_high;
   asm_volatile("RDTSCP\n\t"
                : "=d" (cycles_high), "=a" (cycles_low));
   return (uint64_t) cycles_high << 32 | cycles_low;
}

as you know, RDTSCP is pseudo serializing ,why someone use the second code？two reasons I guess, below:

Maybe in most situation, RDTSCP can ensure complete "in-order exectuion"?
Maybe just want to avoid using CPUID for efficient?

I can't imagine how the second implementation can be justified due to the reason you have mentioned and stated in that Intel whitepaper. Also, the `rdtscp` in your bench_start() is redundant due to the previous cpuid call. You save a byte by just calling `rdtsc`, which is the way recommended in that awesome Intel whitepaper — Gavin Portwood, Jul 04 '17 at 05:29
**The second inline asm for `rdtscp` is unsafe. It clobbers ECX without telling the compiler.** Use a clobber, or better use the intrinsic. [Get CPU cycle count?](//stackoverflow.com/a/51907627). My answer on that Q&A also has some links to serializing before/after `rdtsc` with `lfence`. — Peter Cordes, Aug 18 '18 at 15:27
Possible duplicate of [Get CPU cycle count?](https://stackoverflow.com/questions/13772567/get-cpu-cycle-count) — Peter Cordes, Aug 18 '18 at 15:28

how to use rdtscp correctly?

0 Answers0