Calculate system time using rdtsc

Question

Suppose all the cores in my CPU have same frequency, technically I can synchronize system time and time stamp counter pairs for each core every millisecond or so. Then based on the current core I'm running with, I can take the current rdtsc value and using the tick delta divided by the core frequency I'm able to estimate the time passed since I last synchronized the system time and time stamp counter pair and to deduce the current system time without the overhead of system call from my current thread (assuming no locks are needed to retrieve the above data). This works great in theory but in practice I found that sometimes I get more ticks then I would expect, that is, if my core frequency is 1GHz and I took system time and time stamp counter pair 1 millisecond ago I would expect to see a delta in the ticks which is around 10^6 ticks, but actually I found it can be anywhere between 10^6 and 10^7. I'm not sure what is wrong, can anyone share his thoughts on how to calculate system time using rdtsc? My main objective is to avoid the need to perform system call every time I want to know the system time and be able to perform a calculation in user space that will give my a good estimation of it (currently I define a good estimation as a result which is with in 10 micro seconds interval from the real system time.

score 15 · Answer 1 · edited May 23 '17 at 12:09

The idea is not unsound but it is not suited for user-mode applications, for which, as @Basile suggested, there are better alternatives.

Intel itself suggests to use the TSC as a wall-clock:

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states.
This is the architectural behaviour moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

However, care must be taken.

The TSC is not always invariant

In older processors the TSC is incremented on every internal clock cycle, it was not a wall-clock.
Quoting Intel

For Pentium M processors (family [06H], models [09H, 0DH]); for Pentium 4 processors, Intel Xeon processors (family [0FH], models [00H, 01H, or 02H]); and for P6 family processors: the time-stamp counter increments with every internal processor clock cycle.

The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel® SpeedStep® technology transitions may also impact the processor clock.

If you only have a variant TSC, the measurement are unreliable for tracking time. There is hope for invariant TSC though.

The TSC is not incremented at the frequency advised on the brand string

Still quoting Intel

the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency.
On certain processors, the TSC frequency may not be the same as the frequency in the brand string.

You can't simply take the frequency written on the box of the processor.
See below.

`rdtsc` is not serialising

You need to serialise it from above and below.
See this.

The TSC is based on the ART (Always Running Timer) when invariant

The correct formula is

TSC_Value = (ART_Value * CPUID.15H:EBX[31:0] )/ CPUID.15H:EAX[31:0] + K

See section 17.15.4 of the Intel manual 3.

Of course, you have to solve for ART_Value since you start from a TSC_Value. You can ignore the K as you are interested in deltas only. From the ART_Value delta you can get the time elapsed once you know the frequency of the ART. This is given as k * B where k is a constant in the MSR MSR_PLATFORM_INFO and B is 100Mhz or 133+1/3 Mhz depending on the processor.

As @BeeOnRope pointed out, from Skylake the ART crystal frequency is no longer the bus frequency.
The actual values, maintained by Intel, can be found in the turbostat.c file.

switch(model) 
{
case INTEL_FAM6_SKYLAKE_MOBILE: /* SKL */
case INTEL_FAM6_SKYLAKE_DESKTOP:    /* SKL */
case INTEL_FAM6_KABYLAKE_MOBILE:    /* KBL */
case INTEL_FAM6_KABYLAKE_DESKTOP:   /* KBL */
    crystal_hz = 24000000;  /* 24.0 MHz */
    break;
case INTEL_FAM6_SKYLAKE_X:  /* SKX */
case INTEL_FAM6_ATOM_DENVERTON: /* DNV */
    crystal_hz = 25000000;  /* 25.0 MHz */
    break;
case INTEL_FAM6_ATOM_GOLDMONT:  /* BXT */
    crystal_hz = 19200000;  /* 19.2 MHz */
    break;
default:
    crystal_hz = 0; 
}

The TSC is not incremented when the processor enter a deep sleep

This should not be a problem on single socket machines but the Linux kernel has some comment about the TSC being reset even on non-deep sleep states.

The context switches will poison the measurements

There nothing you can do about it.
This actually prevents you from time-keeping with the TSC.

On Intel, there is a further complication since Skylake: the TSC doesn't run at a multiple of BCLK (100/133 MHz), but at a multiple of the clock crystal, which is (for example) 24 MHz on Skylake, but has different frequencies on other architectures. Rather than fill in all the details, I would check out the [turbostat source](https://github.com/torvalds/linux/blob/master/tools/power/x86/turbostat/turbostat.c#L3436) which is maintained by Intel fairly reliably for new processor versions. The link above shows the way to calculate tsc hz where tsc != M * BCLK. — BeeOnRope, Feb 12 '17 at 23:55
... in particular, this means that the TSC frequency no longer even equals the _nominal_ CPU frequency, since usually M * 100 != T * 24, for any integer M, T. For example, with nominal frequency 2600 MHz, you have CPU multiplier 26, but the closest you can get for the TSC is T == 108, for a TSC freq of 2592 MHz. — BeeOnRope, Feb 12 '17 at 23:57
@BeeOnRope Thanks, that link is useful! I've found different sources about the frequency of the ART crystal (some mentioning 24Mhz) but none being definitive. The TSC_value is a ratio of the ART_value, it is easy to find k1 and k2 such that 2600 = ART_freq * k1 / k2 even when ART_freq is 24. I know that Intel can choose to run the TSC at a different frequency, Is that what you mean in your second comment? — Margaret Bloom, Feb 13 '17 at 07:35
What I mean about the last comment is that even though you could _choose_ `k1` and `k2` such that the TSC_Value is the same as the nominal frequency, in practice, it works the other way around: the TSC is determined by hardware implementation which uses uses clock multiplication off of some base block. In Skylake the base clock switched from the BCLK to the 24 MHz/25Mhz/... clock crystal, and the multiplier is usually an integer. On my machine the multiplier is 108, such that 24 MHz * 108 = 2592 MHz, and `k1 == 216` and `k2 == 2`. The nominal CPU frequency is 26 * 100 = 2600 MHz. — BeeOnRope, Feb 13 '17 at 17:43
So the `CPUID.15H` stuff is just the way Intel _represents_ the actual relationship between some base clock and the full TSC frequency, but the hardware implementation isn't as flexible as the representation. The probably use the `k1 / k2` ratio rather than a simple integer multiplier to reflect the fact that you often have clock multiplication (handled by `k1`) and sometimes you also have clock division (e.g., certain counters only count one out of every 1024 cycles or something), but the values are not arbitrary. — BeeOnRope, Feb 13 '17 at 17:55
... and so the upshot is that (nominal) CPU frequency and TSC frequency are somewhat decoupled on Skylake and beyond. This causes various small errors in tools which assume they are the same, and even in tools that don't (neither crystal source is exact, but when everything was on the same base the error usually "cancels out"). For example the APERF and MPERF MSRs used to have a very exact relationship with the TSC counts, but now they don't since the former is based on BCLK and the latter on the 24 MHz crystal. There are a few good possibly reasons to make this switch though... — BeeOnRope, Feb 13 '17 at 17:59
@BeeOnRope When I run turbosat on my coffelake machine i9900k, it prints 3600MHz as TSC clock. I tried to find a fixed TSX frequency used for coffelake by looking at most recent linux kernel source tree, but cannot find any fixed frequency. And it seems that TSC frequency has been measured through CPUID on my machine. BTW, why Intel chooses to make TSX not to follow nominal frequency and their owns? I don't have skylake machine so cannot test it, but does the CPUD also return correct eax_crystal, ebx_tsc, crystal_hz value to calculate Intel specified TSX frequency? — ruach, Sep 25 '20 at 03:03
@JaehyukLee - I didn't understand everything you are asking, but comments are not the right place. Create a separate question or question(s). — BeeOnRope, Sep 25 '20 at 03:17
@BeeOnRope Oops. I think I misunderstood how the TSC frequency is calculated... sorry for being annoying. When I execute cpuid (15) it returns EAX:2 EBX:300. And assuming that coffelake also has crystal_hz = 24000000; then it returns 3.6GHz as it's TSX frequency. I thought that coffeelake has its own crystal_hz not the one assigned for skylake. — ruach, Sep 25 '20 at 03:27
@Jaehyuk Coffee lake is probably the same as SKL as it's the same uarch. The turbostat source is a great reference for calculating these things. — BeeOnRope, Sep 25 '20 at 12:24

Basile Starynkevitch · Accepted Answer · 2017-02-12T16:28:32.657

7

Don't do that -using yourself directly the RDTSC machine instruction- (because your OS scheduler could reschedule other threads or processes at arbitrary moments, or slow down the clock). Use a function provided by your library or OS.

My main objective is to avoid the need to perform system call every time I want to know the system time

On Linux, read time(7) then use clock_gettime(2) which is really quick (and does not involve any slow system call) thanks to vdso(7).

On a C++11 compliant implementation, simply use the standard <chrono> header. And standard C has clock(3) (giving microsecond precision). Both would use on Linux good enough time measurement functions (so indirectly vdso)

Last time I measured clock_gettime it often took less than 4 nanoseconds per call.

edited Feb 12 '17 at 16:28

answered Feb 12 '17 at 16:22

Basile Starynkevitch

223,805
18
296
547

Would the [seqlock](https://elixir.bootlin.com/linux/latest/source/lib/vdso/gettimeofday.c#L66) within `clock_gettime` incurs unwanted overhead if I invoke `clock_gettime` frequently? Specifically, I have about ~60ns additional overhead when [using `clock_gettime` and `bpf_ktime_get_ns` to measure CPU mode switch overhead which a typical system call would incurs](https://stackoverflow.com/questions/65753521/using-ebpf-to-measure-cpu-mode-switch-overhead-incured-by-making-system-call). Thanks! – Feng. Ma Jan 24 '21 at 17:01

Calculate system time using rdtsc

2 Answers2

The TSC is not always invariant

The TSC is not incremented at the frequency advised on the brand string

`rdtsc` is not serialising

The TSC is based on the ART (Always Running Timer) when invariant

The TSC is not incremented when the processor enter a deep sleep

The context switches will poison the measurements

Linked

Calculate system time using rdtsc

2 Answers2

The TSC is not always invariant

The TSC is not incremented at the frequency advised on the brand string

rdtsc is not serialising

The TSC is based on the ART (Always Running Timer) when invariant

The TSC is not incremented when the processor enter a deep sleep

The context switches will poison the measurements

Linked

`rdtsc` is not serialising