1

What is the difference between using thet= __rdtsc() function directly and using

 asm __volatile__ (
       "rdtsc               \n"
       : "=a" (t));   

I have seen on stackoverflow that many people do not recommend using inline assembly, but then I have seen that the source code of many top conferences uses inline assembly. Is there any difference between them?

Gerrie
  • 736
  • 3
  • 18
  • 4
    People recommend not using GCC inline assembly because it is very easy to screw things up that seem very basic. It may work for a while and then magically break code at a later date. Using compiler intrinsic have the benefit of not relying on the nuances of inline assembly. For example the inline assembly you show is actually buggy because the `rdtsc` instruction clobbers EDX but the inline assembly never told the compiler that. This can lead to unexpected behaviour if the compiler assumed that EDX had the same value before and after the inline assembly. – Michael Petch Nov 30 '20 at 12:20
  • If I use t= __rdtscp( & junk) instead of inline assembly, will it behave differently? For example, will the junk variable affect the cache? – Gerrie Nov 30 '20 at 12:26
  • 1
    The `__rdtscp` intrinsic is a very thin wrapper around the `rdtscp` instruction that doesn't generate fence or serializing instructions. The MS docs say this about the intrinsic: _This instruction waits until all previous instructions have executed and all previous loads are globally visible. However, it isn't a serializing instruction. For more information, see the Intel and AMD manuals_ . I should also point out that `rdtscp` and `rdtsc` instructions are different. – Michael Petch Nov 30 '20 at 12:40
  • @MichaelPetch: Isn't that only the case for 32-bit x86? Or does it also clobber rdx on 64-bit? – R.. GitHub STOP HELPING ICE Nov 30 '20 at 15:58
  • 1
    @R..GitHubSTOPHELPINGICE : even in 64-bit code the result of RDTSC are returned in EDX:EAX even though the result could have been held in a single 64-bit register. The high 32-bits of the registers RDX and RAX are set to 0. That behavior is documented in the instruction set architecture reference: https://www.felixcloutier.com/x86/rdtsc – Michael Petch Nov 30 '20 at 16:01
  • 1
    @R..GitHubSTOPHELPINGICE: I think AMD decided to leave instructions like RDTSC and RDPMC that used EDX:EAX unchanged for 64-bit mode so the decoders could decode them the same, and the microcode could be the same. One of many short-sighted / conservative choices by AMD for minor things that made x86-64 less of an improvement over IA-32 than it could have been. (But also had the short-term benefit of making adoption by compiler / OS devs maybe easier?) I assume AMD didn't want to be stuck spending extra silicon if AMD64 didn't catch on, and it took years for consumer Windows to go 64-bit. – Peter Cordes Nov 30 '20 at 20:58
  • Is using t=__rdtsc() the same as inline assembly? As the question says? – Gerrie Dec 01 '20 at 00:44
  • 1
    If the compiler is halfway decent at optimizing, and if you write the inline asm correctly, they should be the same. But why not look at the generated assembly and see for yourself? – Nate Eldredge Dec 01 '20 at 00:45
  • @NateEldredge: My answer on [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627) shows actual compiler output: there are sometimes missed optimizations if you just want the low 32 bits of the TSC difference, sometimes compilers still waste instructions dealing with the EDX high half. – Peter Cordes Dec 01 '20 at 03:24

0 Answers0