rdtsc accuracy across CPU cores

Question

I am sending network packets from one thread and receiving replies on a 2nd thread that runs on a different CPU core. My process measures the time between send & receive of each packet (similar to ping). I am using rdtsc for getting high-resolution, low-overhead timing, which is needed by my implementation.

All measurments looks reliable. Still, I am worried about rdtsc accuracy across cores, since I've been reading some texts which implied that tsc is not synced between cores.

I found the following info about TSC in wikipedia

Constant TSC behavior ensures that the duration of each clock tick is uniform and supports the use of the TSC as a wall clock timer even if the processor core changes frequency. This is the architectural behavior moving forward for all Intel processors.

Still I am worried about accruracy across cores, and this is my question

More Info

I run my process on an Intel nehalem machine.
Operating System is Linux.
The "constant_tsc" cpu flag is set for all the cores.

I was not aware of HPET. I just read about it and it seems to be kind of high precision timer (that is interrupt based) and not clock. I need ability to read high resolution clock upon my need (example: at arrival of network packet) — avner, Sep 12 '10 at 14:45

score 36 · Accepted Answer · edited Jun 20 '20 at 09:12

36

X86_FEATURE_CONSTANT_TSC + X86_FEATURE_NONSTOP_TSC bits in cpuid (edx=x80000007, bit #8; check unsynchronized_tsc function of linux kernel for more checks)

Intel's Designer's vol3b, section 16.11.1 Invariant TSC it says the following

"16.11.1 Invariant TSC

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor's support for invariant TSC is indicated by CPUID.80000007H:EDX[8].

The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource."

So, if TSC can be used for wallclock, they are guaranteed to be in sync.

edited Jun 20 '20 at 09:12

Community

1
1

answered Nov 10 '10 at 13:52

osgx

90,338
53
357
513

5

@avner, it's possible to check tsc variation between cpu cores/ cpu packages with simple 2 thread test, which does "ping-pong" using shared variables and busy-waiting for waiting an event (no mutexes, only reads/writes; also an rdtsc reading). When threads pinned to different cores, they will give you tsc0-tsc1. Then set affinity in reverse order to get tsc1-tsc0. If both are equal, you have a synchronous TSC – osgx Nov 10 '10 at 13:56
thanks osgx. your answer sounds very interesting. I found equivalent answer http://www.gossamer-threads.com/lists/xen/devel/185419 . From which I understand that constant_tsc + nonstop_tsc are equivalent to invariant(across processors) with few more assumption regarding BIOS/mobo. My old tests showed no drifts between cpus - I only concerened about future tests and about guarntee to work at customer site. Buttom line, with your answer, I feel more confident than before; hence, I'll accept this answer and hope for good :-) – avner Nov 11 '10 at 07:52
@avner, some modern CPU might have no "invariant TSC", this feature must be checked. – osgx Nov 11 '10 at 16:54
4

You should still beware: while the tsc is guranteed to be consistent across multiple cores with this flag, chance is the system may be equipped with multiple CPUs. – Suma Dec 22 '10 at 14:48
1

@Suma the reasoning in this answer is that the documentation saying you can count on using rdtsc for walk clock time means that you must be able to count on it being synchronized between cores. If that reasoning holds, any doesn't it also apply for between CPUs? – Joseph Garvin May 16 '16 at 23:30

Alexandre Pereira Nunes · Answer 2 · 2016-07-28T19:04:49.543

4

On recent processors you can do it between separate cores of the same package (i.e. a system with just one core iX processor), you just can't do it in separate packages (processors), because they won't share the rtc. You can get away with it via cpu affinity (locking relevant threads to specific cores), but then again it would depend on the way your application behaves.

On linux you can check constant_tsc on /proc/cpuinfo in order to see if the processor has a single tsc valid for the entire package. The raw register is in CPUID.80000007H:EDX[8]

What I read around, but have not yet confirmed programatically, is that AMD cpus from revision 11h onwards have the same meaning for this cpuid bit.

edited Jul 28 '16 at 19:04

answered Jan 17 '14 at 12:52

Alexandre Pereira Nunes

1,068
13
22

1

What does IRQ balance have to do with it? – Joseph Garvin Jul 28 '16 at 15:23
1

@JosephGarvin absolutely nothing, rdtsc is tighly coupled. I probably had too much coffee - or was thinking on something else entirely, but it's hard to remember after all these years. Nice catch thought, I'll edit it. – Alexandre Pereira Nunes Jul 28 '16 at 19:03

Will · Answer 3 · 2014-04-04T18:01:42.327

In fact, it seems that cores doesn´t share TSC, check this thread: http://software.intel.com/en-us/forums/topic/388964

Summarizing, different cores does not share TSC, sometimes TSC can get out of synchronization if a core change to an specific energy state, but it depends on the kind of CPU, so you need to check the Intel documentation. It seems that most Operating Systems synchronize TSC on boot.
I checked the differences between TSC on different cores, using an exciting-reacting algorithm, on a Linux Debian machine with core i5 processor. The exciter process (in one core) writed the TSC in a shared variable, when the reacting process detected a change in that variable it compares its value and compares it with its own TSC. This is an example output of my test program:

TSC ping-pong test result:
TSC cores (exciter-reactor): 0-1
100 records, avrg: 159, range: 105-269
Dispersion: 13
TSC ping-pong test result:
TSC cores (exciter-reactor): 1-0
100 records, avrg: 167, range: 125-410
Dispersion: 13

The reaction time when the exciter CPU is 0 (159 tics on average) is almost the same than when the exciter CPU is 1 (167 tics). This indicates that they are pretty well synchronized (perhaps with a few tics of difference). On other core pairs, results were very similar.
On the other hand, rdtscp assembly instruction return a value indicating the CPU in which the TSC was read. It is not your case but it can be useful when you want to measure time in a simple code segment and you want to ensure that the process was not moved of CPU in the middle of the code.

While this link may answer the question, it is better to include the essential parts of the answer here and provide the link for reference. Link-only answers can become invalid if the linked page changes. — Matthew Green, Feb 19 '14 at 22:30
@MatthewGreen I expanded my answer with some results of my own research. I left the previous url because it can still be useful and is not likely to become invalid. — Will, Apr 04 '14 at 18:05

score 2 · Answer 4 · answered Aug 29 '10 at 08:40

2

On linux you can use clock_gettime(3) with CLOCK_MONOTONIC_RAW, which gives you nanoseconds resulotion and in not subject to ntp updates (if any happened).

answered Aug 29 '10 at 08:40

nir

96
3

2

Thanks, still not good. CLOCK_MONOTONIC_RAW is undefined in my environment. I do have CLOCK_MONOTONIC in time.h which I already tried. It is true that struct timespec has nanoseconds resolution; still when calling clock_gettime for CLOCK_MONOTONIC the last 3 digits has always same value; hence, practically it is only microseconds resolution. – avner Aug 30 '10 at 08:42
Prehaps your system does not support high resolution timers? What do you get when you run this code: #include #include #include int main() { while (1){ struct timespec ts; clock_gettime(CLOCK_MONOTONIC, &ts); printf("%ld %ld\n", ts.tv_sec, ts.tv_nsec); sleep(1); } return 0; } – nir Aug 31 '10 at 08:10
4

1) bellow are the 1st 5 lines that are outputed by your code (you can see that nsec is always 246) 279595 629885246 279596 630958246 279597 631777246 279598 633596246 2) Additional problem with clock_gettime is its overhead. According to my statiscs (taking clock 1001 times repeatidly without sleep), the average overhead of clock_gettime(CLOCK_MONOTONIC, &ts) is 281 nsec on a strong nehalem machine; while taking rdtsc consumes only 8 nsec on same machine. – avner Sep 02 '10 at 13:14
This strongly suggests that you do not have high resolution timers enabled. Try your distor's documentation for this. As for the overhead, there is really not much you can do about that... – nir Sep 05 '10 at 11:45
2

My question was about rdtsc, because I need high resolution and low overhead. I only want to make sure its reliablity across cores, since I couldn't find documentation about it. Though, my feeling is good. Beside, In order to use high resolution timer (probably, CLOCK_MONOTONIC_HR), I'll need to recompile the kernel. This is not an option, since I can't require that from all my customers. – avner Sep 06 '10 at 09:04
Then just make sure to disable cpu throttling and set your affinity to a specific cpu. – nir Sep 16 '10 at 08:52
1

I already set cpu affinity :) My question is about 2 threads on 2 different CPU cores. My recv thread polls the NIC for packets without context switches and without delay for sending outbound packets. In my environment microseconds count a lot! – avner Oct 03 '10 at 13:05
Also last time I checked, unlike gettimeofday or CLOCK_REALTIME, CLOCK_MONOTONIC_RAW doesn't have a fast syscall implementation. – Yu Zhou Oct 15 '15 at 21:10

score 1 · Answer 5 · answered Oct 16 '10 at 23:27

1

You can set thread affinity using sched_set_affinity() API in order to run your thread on one CPU core.

answered Oct 16 '10 at 23:27

Dima

1,253
3
21
31

I already set cpu affinity :( My question is about 2 threads on 2 different CPU cores. My recv thread polls the NIC for packets without context switches and without delay for sending outbound packets. In my environment microseconds count a lot! – avner Oct 17 '10 at 11:49
HEPT sounds bad for my needs - see my previous comment about HEPT. RDTSC looks great and reliable in hunders of tests I did in multiple core environment (even for machines that were up for many weeks). In addition, please read the citation about "TSC as a wall clock timer" in my query. Bottom line, I only look for formal approval. Practically, RDTSC does seem to do the work. – avner Oct 19 '10 at 13:55
The drift between cores may occurs (by hundreds of milliseconds). – Dima Oct 26 '10 at 20:04

score 0 · Answer 6 · answered Aug 02 '10 at 21:50

0

I recommend that you don't use rdtsc. Not only is it not portable, it's not reliable and generally won't work - on some systems the rdtsc does not update uniformly (like if you're using speedstep etc). If you want accurate timing information you should set the SO_TIMESTAMP option on the socket and use recvmsg() to get the message with a (microsecond resolution) timestamp.

Moreover, the timestamp you get with SO_TIMESTAMP actually IS the time the kernel got the packet, not when your task happened to notice.

answered Aug 02 '10 at 21:50

MarkR

62,604
14
116
151

5

Thanks for the answer. Notice that with constant_tsc flag, rdtsc does update uniformly; see the quote I added to my question. SO_TIMESTAMP is at msec percision, while rdtsc is at nsec percision and this is the percision I need. I am not interested in the time the packet arrived to the kernel, but in the time the user got it, since this is the part my application accelerates. – avner Aug 03 '10 at 07:26

rdtsc accuracy across CPU cores

More Info

6 Answers6

Linked