5

I'm confused whether rdtscp monotonically increments in a multi-core environment. According to the document: __rdtscp, rdtscp seems a processor-based instruction and can prevent reordering of instructions around the call.

The processor monotonically increments the time-stamp counter MSR every clock cycle and resets it to 0 whenever the processor is reset.

rdtscp definitely increments monotonically on the same CPU core, but is this rdtscp timestamp guaranteed monotonic across different CPU cores? I believe there is no such absolute guarantee. For example,

Thread on CPU core#0                   Thread on CPU core#1

unsigned int ui;
uint64_t t11 = __rdtscp(&ui); 
uint64_t t12 = __rdtscp(&ui);  
uint64_t t13 = __rdtscp(&ui);         
                                       unsigned int ui;
                                       uint64_t t21 = __rdtscp(&ui);
                                       uint64_t t22 = __rdtscp(&ui);
                                       uint64_t t23 = __rdtscp(&ui);

By my understanding, we can have a decisive conclusion t13 > t12 > t11, but we cannot guarantee t21 > t13.

I want to write a script to test if my understanding is correct or not, but I don't know how to construct an example to validate my hypothesis.

// file name: rdtscptest.cpp
// g++ rdtscptest.cpp -g -lpthread -Wall -O0 -o run
#include <chrono>
#include <thread>
#include <iostream>
#include <string>
#include <string.h>
#include <vector>
#include <x86intrin.h>

using namespace std;

void test(int tid) {
    std::this_thread::sleep_for (std::chrono::seconds (tid));
    unsigned int ui;
    uint64_t tid_unique_ = __rdtscp(&ui);
    std::cout << "tid: " << tid << ", counter: " << tid_unique_ << ", ui: " << ui << std::endl;
    std::this_thread::sleep_for (std::chrono::seconds (1));
}

int main() {
    size_t trd_cnt = 3 ;
    std::vector<std::thread> threads(trd_cnt);

    for (size_t i=0; i< trd_cnt; i++) {
        // three threads with tid: 0, 1, 2
        // force different threads to run on different cpu cores
        threads[i] = std::thread(test, i);  
        cpu_set_t cpuset;
        CPU_ZERO(&cpuset);
        CPU_SET(i, &cpuset);
        int rc = pthread_setaffinity_np(threads[i].native_handle(),
                                        sizeof(cpu_set_t), &cpuset);
        if (rc != 0) {
            std::cout << "Error calling pthread_setaffinity_np, code: " << rc << "\n";
        }
    }

    for (size_t i=0; i< trd_cnt; i++) {
        threads[i].join() ;
    }

    return 0;
}

So, two questions here:

  1. Is my understanding correct or not?
  2. How to construct an example to validate it?

==========updated, according to comments

__rdtscp will (always?) increment across cores on advanced cpus

Sep Roland
  • 33,889
  • 7
  • 43
  • 76
stickers
  • 83
  • 1
  • 6
  • Different cores can change speed (frequency) at different rate. It means I think different cores can have totally different values for RDTSC, one can have even twice more than other, not just a bit. But I'm not sure, needs experimenting or other expert knowledge. – Arty Jan 31 '21 at 04:15
  • 1
    @Arty: RDTSC counts fixed-freq reference cycles, not core clock cycles, on all CPUs since at least Core2Duo. Earlier multi-core systems, like multi-socket one-core-per-package SMP systems, might have been different. But yes it's possible for the TSC to be out of sync across cores if the OS reset it, or the HW + firmware didn't get it synced on power-up across sockets. (answer with lots of TSC details: [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/a/51907627)). However, modern single-socket desktops do normally have synced TSCs across all cores. – Peter Cordes Jan 31 '21 at 04:20
  • @Arty same thought, but some people said `rdtsc` is synchronized across multi-cores, not to mention advanced version `rdtscp`, e.g., [CPU TSC fetch operation especially in multicore-multi-processor environment](https://stackoverflow.com/questions/10921210/cpu-tsc-fetch-operation-especially-in-multicore-multi-processor-environment) – stickers Jan 31 '21 at 04:23
  • 2
    TSC is a 64-bit counter that counts at ~4.2 GHz on some CPUs. It can in theory wrap if the computer has been "up" for over 2^32 seconds (a few decades), or if the TSC has been manually set to have a big offset. Other than that, yes it's true that it's monotonic on one core. And with `constant_tsc` on a normal motherboard (even multi-socket), yes it will (almost always?) be monotonic across cores, as long as you actually do some thread synchronization to make sure one thread's code runs after the other's. You are *not* doing that. – Peter Cordes Jan 31 '21 at 04:56
  • The accepted answer in the Q/A you linked is mostly incorrect. The other answers there are OK. Another thing, if you want your code to be robust, it's better to check and handle the case of TSC not monotonically increasing on the same core. There are buggy CPUs on which this may happen. – Hadi Brais Jan 31 '21 at 11:59
  • @HadiBrais do you mean the accepted answer in this [link](disable-cross-partition-transactions) is incorrect, right? But the question is other answers are similar to the accepted answer: TSC is synchronized across multi-cores. – stickers Jan 31 '21 at 15:45
  • The one in [this](https://stackoverflow.com/questions/10921210/cpu-tsc-fetch-operation-especially-in-multicore-multi-processor-environment) Q/A. It quotes a part of the manual to imply that an invariant TSC is necessarily synchronized, but that part of the manual doesn't imply that and that statement is not necessarily true. – Hadi Brais Feb 01 '21 at 04:21

1 Answers1

7

On most systems yes, if you create synchronization between threads to make sure that one actually does run after the other1. Otherwise all bets are off; starting one thread before another does not ensure that its code executes first.

Footnote 1: e.g. having one spin-wait until it sees an atomic store done by the other. Or use a mutex and run rdtscp in a critical section, along with a variable to record whether the other thread was already there.


On anything non-ancient (like Core2 and newer at least), TSC ticks at constant frequency (the "reference") frequency. See this answer for links and details about the constant_tsc / nonstop_tsc CPU features, and the possibility of TSC not being synced.

Most modern systems in practice do have the TSC synced between cores I think, thanks to motherboard vendors making sure that even on multi-socket systems the RESET signal is distributed to all cores at the same time. And firmware and OS software taking care not to screw it up. It's much easier on a single-socket system like a normal desktop with a multicore CPU where all the "extra" cores are on the same chip.

But this is not guaranteed, and part of why rdtscp exists (with a processor ID output) is this possibility (which I think might have been more common on older systems when RDTSCP was new).

There are even CPU features VMs can use to offset and scale the TSC transparently (with HW support), to migrate VMs between physical machines while preserving monotonicity and frequency of the TSC. Using these features indiscriminately can of course produce desynced TSCs or even ones that run at different frequencies on different cores.


TSC is a 64-bit counter that usually counts at the CPUs rated sticker frequency. This can be over ~4.2 GHz (2^32) on some CPUs, so that leaves the high half incrementing about once per second on fast CPUs. The TSC can in theory wrap if the computer has been "up" for over 2^32 seconds (several decades), or if the TSC has been manually set to have a big offset.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847