CPU TSC fetch operation especially in multicore-multi-processor environment

Question

In Linux world, to get nano seconds precision timer/clockticks one can use :

#include <sys/time.h>

int foo()
{
   timespec ts;

   clock_gettime(CLOCK_REALTIME, &ts); 
   //--snip--      
}

This answer suggests an asm approach to directly query for the cpu clock with the RDTSC instruction.

In a multi-core, multi-processor architecture, how is this clock ticks/timer value synchronized across multiple cores/processors? My understanding is that there in inherent fencing being done. Is this understanding correct?

Can you suggest some documentation that would explain this in detail? I am interested in Intel Nehalem and Sandy Bridge microarchitectures.

EDIT

Limiting the process to a single core or cpu is not an option as the process is really huge(in terms of resources consumed) and would like to optimally utilize all the resources in the machine that includes all the cores and processors.

Edit

Thanks for the confirmation that the TSC is synced across cores and processors. But my original question is how is this synchronization done ? is it with some kind of fencing ? do you know of any public documentation ?

Conclusion

Thanks for all the inputs: Here's the conclusion for this discussion: The TSCs are synchronized at the initialization using a RESET that happens across the cores and processors in a multi processor/multi core system. And after that every Core is on their own. The TSCs are kept invariant with a Phase Locked Loop that would normalize the frequency variations and thus the clock variations within a given Core and that is how the TSC remain in sync across cores and processors.

You can't count on clock_gettime() for nanosecond precision, by the way; it's only precise to within about a quarter microsecond. I ran into this when I was trying to get super-precise timings and found that gettime() itself cost more than 250ns. http://stackoverflow.com/questions/7935518/is-clock-gettime-adequate-for-submicrosecond-timing — Crashworks, Jun 07 '12 at 23:05
if TSC is used for providing time stamp, it is supposed to reflect only delta nano seconds. I am using linux. And my understanding is that kernel provides the expected performance. windows - may be not. — Jay D, Jun 07 '12 at 23:16
@Crashworks pls read my latest comment on this question link you shared. — Jay D, Jun 07 '12 at 23:25
@Crashworks I am interested to know if you see the performance hit with the latest generation Intel processors with latest Linux kernel (either 2.6 or 3.0) — Jay D, Jun 07 '12 at 23:26
It was only Linux where I ran into trouble with gettime(). In my most recent test, it appears to still have the 250ns overhead of a kernel call. On Windows I just used the `rdtsc()` intrinsic which has about 1ns of overhead. — Crashworks, Jun 07 '12 at 23:41
I do not understand the term "fencing" in this context. But after the /RESET signal no explicit resynchronization will ever occur. Note that a resynchronization would violate the "invariant" property of the "resynced" TSC anyway - what should it do if it needs to step back... But because the TSC runs at exactly the same speed on each core (the have all a common clock source) and the start at exactly the same time(because the /RESET signal is synched), they will always have the same value on each core. — Gunther Piez, Jun 10 '12 at 09:35
The processor sockets or even the cores in a given processor socket may experience different environmental conditions such as temperature which would result in frequency variations resulting in TSC increment variations across cores in a processor and across processor sockets. This is my hypothesis. I couldn't find documentation supporting or opposing it. So also is the thought of "Fencing " for keeping them in sync at fixed interval! I think resync after a fixed period of intervals is bound to be required. — Jay D, Jun 10 '12 at 20:05
Though the delta TSC would be a few nano secs, May be its part of the Processor technology itself ? you are welcome to prove me wrong with a documentation ! — Jay D, Jun 10 '12 at 20:05
@JayD There is no separate clock source in each processor which could drift with environmental changes, there is only a PLL, which is, as the name says, _locked_. The socket clock (which is always fixed, for instance 100 MHz on a SB) comes from an external source (see Intel® Core™ Processor Family Desktop Datasheet Chapter 2.6 "clocking") and is multiplied by a fixed factor to get the TSC clock. — Gunther Piez, Jun 11 '12 at 07:23
A phase locked loop suffers from phase noise (shift) as it follows its source. If there is a guarantee that there is one source of this particular clock for all destinations, and access to it in all cpu's is guaranteed to be consistent, there should be no problem. That's a lot of assuming, however. — John S Gruber, Jun 14 '12 at 01:21
@JohnSGruber I am not a main board designer, but I am pretty sure those assumptions hold - I have yet to meet a multi socket main board with more than one clock source (which I think wouldn't really make sense anyway) and I am sure that the phase correction is described somewhere in the Intel docs :-) — Gunther Piez, Jun 14 '12 at 09:07
It's not about multiple clock sources. It's about a PLL cell in each core essentially generating its own clock that not only has short-term period variations compared to all other ones, but also has a non-zero long-term drift that is different to all other cores. A multicore CPU uses one PLL per core, they are all referenced to the single clock source. But a PLL uses that single clock for reference only, and this referencing process introduces errors. — Kuba hasn't forgotten Monica, Jun 16 '12 at 07:05
**PLL is to normalize the frequency variations within a given Core. THUS IT WON'T normalize the frequency variations and subsequent clock variations across the cores and across processors -- ONCE INITIAL syncing of TSCs is done with a RESET or otherwise.** — Jay D, Jun 16 '12 at 09:11
I recommend reading http://en.wikipedia.org/wiki/Phase-locked_loop . The job of a PLL is _synchronizing_ two frequencies - the reference clock and the TSC clock in that case. That one is a multiple of the other doesn't really matter, that can be taken care of with a frequency divider. Since all TSC clocks are synchronized to the single reference base clock, they are also in sync with each other. — Gunther Piez, Jun 16 '12 at 16:36
"THUS IT WON'T normalize the frequency variations and subsequent clock variations" - that is wrong. This exactly what a PLL does. BTW, bold type setting doesn't make it more right in any way ;-) — Gunther Piez, Jun 16 '12 at 16:37
Drhirsch: you quoted my half statement. I have said PLL would normalize frequency on a given core , NOT across cores. And with bold type, i was trying to summerize the discussion did i miss something ? Regards — Jay D, Jun 16 '12 at 19:02
Yes, you are right, the PLLs do not sync directly across Cores, and you are wrong because they do sync to a common clock and so do sync the TSCs indirectly. I understood it somehow wrong (I am not a native speaker). Anyway, it doesn't really matter :-) — Gunther Piez, Jun 16 '12 at 21:03

amdn · Answer 1 · 2012-06-16T21:06:45.040

Straight from Intel, here's an explanation of how recent processors maintain a TSC that ticks at a constant rate, is synchronous between cores and packages on a multi-socket motherboard, and may even continue ticking when the processor goes into a deep sleep C-state, in particular see the explanation by Vipin Kumar E K (Intel):

http://software.intel.com/en-us/articles/best-timing-function-for-measuring-ipp-api-timing/

Here's another reference from Intel discussing the synchronization of the TSC across cores, in this case they mention the fact that rdtscp allows you to read both the TSC and the processor id atomically, this is important in tracing applications... suppose you want to trace the execution of a thread that might migrate from one core to another, if you do that in two separate instructions (non-atomic) then you don't have certainty of which core the thread was in at the time it read the clock.

http://software.intel.com/en-us/articles/intel-gpa-tip-cannot-sychronize-cpu-timestamps/

All sockets/packages on a motherboard receive two external common signals:

RESET
Reference CLOCK

All sockets see RESET at the same time when you power the motherboard, all processor packages receive a reference clock signal from an external crystal oscillator and the internal clocks in the processor are kept in phase (although usually with a high multiplier, like 25x) with circuitry called a Phase Locked Loop (PLL). Recent processors will clock the TSC at the highest frequency (multiplier) that the processor is rated (so called constant TSC), regardless of the multiplier that any individual core may be using due to temperature or power management throttling (so called invariant TSC). Nehalem processors like the X5570 released in 2008 (and newer Intel processors) support a "Non-stop TSC" that will continue ticking even when conserving power in a deep power down C-state (C6). See this link for more information on the different power down states:

http://www.anandtech.com/show/2199

Upon further research I came across a patent Intel filed on 12/22/2009 and was published on 6/23/2011 entitled "Controlling Time Stamp Counter (TSC) Offsets For Mulitple Cores And Threads"

http://www.freepatentsonline.com/y2011/0154090.html

Google's page for this patent application (with link to USPTO page)

http://www.google.com/patents/US20110154090

From what I gather there is one TSC in the uncore (the logic in a package surrounding the cores but not part of any core) which is incremented on every external bus clock by the value in the field of the machine specific register specified by Vipin Kumar in the link above (MSR_PLATFORM_INFO[15:8]). The external bus clock runs at 133.33MHz. In addition each core has it's own TSC register, clocked by a clock domain that is shared by all cores and may be different from the clock for any one core - therefore there must be some kind of buffer when the core TSC is read by the RDTSC (or RDTSCP) instruction running in a core. For example, MSR_PLATFORM_INFO[15:8] may be set to 25 on a package, every bus clock the uncore TSC increments by 25, there is a PLL that multiplies the bus clock by 25 and provides this clock to each of the cores to clock their local TSC register, thereby keeping all TSC registers in synch. So to map the terminology to actual hardware

Constant TSC is implemented by using the external bus clock running at 133.33 MHz which is multiplied by a constant multiplier specified in MSR_PLATFORM_INFO[15:8]
Invariant TSC is implemented by keeping the TSC in each core on a separate clock domain
Non-stop TSC is implemented by having an uncore TSC that is incremented by MSR_PLATFORM_INFO[15:8] ticks on every bus clock, that way a multi-core package can go into deep power down (C6 state) and can shutdown the PLL... there is no need to keep a clock at the higher multiplier. When a core is resumed from C6 state its internal TSC will get initialized to the value of the uncore TSC (the one that didn't go to sleep) with an offset adjustment in case software has written a value to the TSC, the details of which are in the patent. If software does write to the TSC then the TSC for that core will be out of phase with other cores, but at a constant offset (the frequency of the TSC clocks are all tied to the bus reference clock by a constant multiplier).

THanks for your answer. Your first link talks about a timing wrapper in the intel IPP library. IPP is a image processing library. The link merely states the same fact as mentioned above, that TSC are synchronized across cores in modern day processors . but it doesn't provide the reason why -- The original question .! — Jay D, Jun 16 '12 at 08:58
Your second link talks about how the Intel Graphics chips report if the TSCs are not in sync. and how they cope-up with the delta TSCs. the article doesn't really talk about how the TSCs are synchronized. — Jay D, Jun 16 '12 at 09:00
third link talks about characteristics of nehalem. and Phase Locked Loop(PLL) would normalize the clock for a given Core - NOT ACROSS Cores and across processors. — Jay D, Jun 16 '12 at 09:10
Jay, I found an Intel patent on this subject and will update my answer to include that link. Thanks for the bonus points. — amdn, Jun 16 '12 at 19:07
I added two links to the patent and my interpretation in my answer above — amdn, Jun 16 '12 at 23:58

Gunther Piez · Accepted Answer · 2012-06-08T07:15:52.163

On newer CPUs (i7 Nehalem+ IIRC) the TSC is synchronzied across all cores and runs a constant rate. So for a single processor, or more than one processor on a single package or mainboard(!) you can rely on a synchronzied TSC.

From the Intel System Manual 16.12.1

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processors support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward.

On older processors you can not rely on either constant rate or synchronziation.

Edit: At least on multiple processors in a single package or mainboard the invariant TSC is synchronized. The TSC is reset to zero at a /RESET and then ticks onward at a constant rate on each processor, without drift. The /RESET signal is guaranteed to arrive at each processor at the same time.

Note that it only applies to Intel processors. It's been a while since I did any testing on AMD (the most recent AMD CPU I tested was, IIRC, the Phenom II), but at the time they didn't even have synchronization across cores in a single die. — Eugene Smith, Jun 08 '12 at 07:26

score 5 · Answer 3 · 2012-06-06T20:11:30.540

RTDSC is not synchronized across CPUs. Thus, you cannot rely on it in a multi-processor systems. The only workaround I can think of for Linux would be to actually restricting the process to run on a single CPU by settings its affinity. This can be done externally using using taskset utility or "internally" using sched_setaffinity or pthread_setaffinity_np functions.

score 5 · Answer 4 · edited May 23 '17 at 12:01

This manual, chapter 17.12, describes the invariant TSC used in the newest processors. Available with Nehalem this time stamp, along with the rtscp instruction, allows one to read a timestamp (not affected by wait-states, etc) and a processor signature in one atomic operation.

It is said to be suitable for calculating wall-clock time, but it obviously doesn't expect the value to be the same across processors. The stated idea is that you can see if successive reads are to the same CPU's clock, or to adjust for multiple CPU reads. "It can also be used to adjust for per-CPU differences in TSC values in a NUMA system."

See also rdtsc accuracy across CPU cores

However, I'm not sure that the final consistency conclusion in the accepted answer follows from the statement that the tsc can be used for wall clock time. If it was consistent, what reason would there be for atomically determining the CPU source of the time.

N.B. The TSC information has moved from chapter 11 to chapter 17 in that Intel manual.

`If it was consistent, what reason would there be for atomically determining the CPU source of the time.` : That's exactly is the question I have asked as part of this discussion. — Jay D, Jun 14 '12 at 01:02
And I'm saying, given the information in the manual, that there's good reason to believe that the time is invariant across CPU states but not that it is across CPUs. That seems to be an inference that is being drawn, but I believe your caution is justified. Note that the instruction to read the cpu signature is new, too. I'd also suggest that if the tsc value is set by the kernel its value (phase) won't be the same even if the TSC's are run by the same clock circuit and therefore have locked frequencies. — John S Gruber, Jun 14 '12 at 01:14

CPU TSC fetch operation especially in multicore-multi-processor environment

4 Answers4

Linked