5

Can different processes run RDTSC at the same time? Or is this a resource that only one core can operate on at the same time? TSC is in every core (at least you can adjust it separately for every core), so it should be possible. But what about Hyper Treading?

How can I test this?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
kuga
  • 1,483
  • 1
  • 17
  • 38

1 Answers1

7

Each physical core has its own TSC; the microcode doesn't have to go off-core, so there's no shared resource that they compete for. Going off-core at all would make it much slower, and made the implementation more complex. Having a counter physically inside each core is a simpler implementation, just counting ticks of a reference-clock signal that's distributed to all cores.

With HyperThreading, the logical cores sharing a physical always compete for execution resources. From Agner Fog's instruction tables, we know that RDTSC on Skylake is 20 uops for the front-end, and has 1 per 25 cycle throughput. At less than 1 uop per clock while executing nothing but RDTSC instructions, competing for the front-end is probably not a problem.

Probably most of those uops can run on any execution port, so it's quite possible that both logical threads can run rdtsc with that throughput.

But maybe there's a not-fully-pipelined execution unit that they'd compete for.

You can test it by putting times 20 rdtsc inside a loop that runs a few 10s of millions of iterations, and running that microbenchmark on a core by itself, and then running it twice pinned to the logical cores of one physical core.

I got curious and did that myself on Linux with perf on a Skylake i7-6700k, with taskset -c 3 and taskset -c 7 (the way Linux enumerates the cores on this CPU, those numbers are the logical cores of the 4th physical core. You can check /proc/cpuinfo to find out on your system.)

To avoid interleaving the output lines if they both finish nearly simultaneously, I used bash process substitution with cat <(cmd1) <(cmd2) to run them both simultaneously and get the output printed in a fixed order. The commands were taskset -c 3 perf stat -etask-clock:u,context-switches,cpu-migrations,page-faults,cycles:u,instructions:u,branches:u,branch-misses:u,uops_issued.any:u,uops_executed.thread:u,cpu_clk_thread_unhalted.one_thread_active:u -r2 ./testloop to count core clock cycles (not reference cycles, so I don't have to be paranoid about turbo / idle clock frequencies).

testloop is a static executable with a hand-written asm loop containing times 20 rdtsc (NASM repeat operator) and dec ebp/jnz, with the top of the loop aligned by 64 in case that ever matters. Before the loop, mov ebp, 10000000 initializes the counter. (See Can x86's MOV really be "free"? Why can't I reproduce this at all? for details on how I do microbenchmarks this way. Or Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths another example of a simple NASM program with a loop using times to repeat instructions.)

 Performance counter stats for './testloop' (2 runs):

          1,278.19 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.19% )
                 4      context-switches          #    0.004 K/sec                    ( +- 11.11% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                  
     5,243,270,118      cycles:u                  #    4.102 GHz                      ( +-  0.01% )  (71.37%)
       219,949,542      instructions:u            #    0.04  insn per cycle           ( +-  0.01% )  (85.68%)
        10,000,692      branches:u                #    7.824 M/sec                    ( +-  0.03% )  (85.68%)
                32      branch-misses:u           #    0.00% of all branches          ( +- 93.65% )  (85.68%)
     4,010,798,914      uops_issued.any:u         # 3137.885 M/sec                    ( +-  0.01% )  (85.68%)
     4,010,969,168      uops_executed.thread:u    # 3138.018 M/sec                    ( +-  0.00% )  (85.78%)
                 0      cpu_clk_thread_unhalted.one_thread_active:u #    0.000 K/sec                    (57.17%)

           1.27854 +- 0.00256 seconds time elapsed  ( +-  0.20% )


 Performance counter stats for './testloop' (2 runs):

          1,278.26 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.18% )
                 6      context-switches          #    0.004 K/sec                    ( +-  9.09% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                    ( +- 20.00% )
     5,245,894,686      cycles:u                  #    4.104 GHz                      ( +-  0.02% )  (71.27%)
       220,011,812      instructions:u            #    0.04  insn per cycle           ( +-  0.02% )  (85.68%)
         9,998,783      branches:u                #    7.822 M/sec                    ( +-  0.01% )  (85.68%)
                23      branch-misses:u           #    0.00% of all branches          ( +- 91.30% )  (85.69%)
     4,010,860,476      uops_issued.any:u         # 3137.746 M/sec                    ( +-  0.01% )  (85.68%)
     4,012,085,938      uops_executed.thread:u    # 3138.704 M/sec                    ( +-  0.02% )  (85.79%)
             4,174      cpu_clk_thread_unhalted.one_thread_active:u #    0.003 M/sec                    ( +-  9.91% )  (57.15%)

           1.27876 +- 0.00265 seconds time elapsed  ( +-  0.21% )

vs. running alone:

 Performance counter stats for './testloop' (2 runs):

          1,223.55 msec task-clock:u              #    1.000 CPUs utilized            ( +-  0.52% )
                 4      context-switches          #    0.004 K/sec                    ( +- 11.11% )
                 0      cpu-migrations            #    0.000 K/sec                  
                 2      page-faults               #    0.002 K/sec                  
     5,003,825,966      cycles:u                  #    4.090 GHz                      ( +-  0.00% )  (71.31%)
       219,905,884      instructions:u            #    0.04  insn per cycle           ( +-  0.04% )  (85.66%)
        10,001,852      branches:u                #    8.174 M/sec                    ( +-  0.04% )  (85.66%)
                17      branch-misses:u           #    0.00% of all branches          ( +- 52.94% )  (85.78%)
     4,012,165,560      uops_issued.any:u         # 3279.113 M/sec                    ( +-  0.03% )  (85.78%)
     4,010,429,819      uops_executed.thread:u    # 3277.694 M/sec                    ( +-  0.01% )  (85.78%)
        28,452,608      cpu_clk_thread_unhalted.one_thread_active:u #   23.254 M/sec                    ( +-  0.20% )  (57.01%)

           1.22396 +- 0.00660 seconds time elapsed  ( +-  0.54% )

(The counter for cpu_clk_thread_unhalted.one_thread_active:u only counts at some slow rate; the system was fairly idle during this test so it should have had the core to itself the whole time. i.e. that ~23.2 M counts / sec does represent single-thread mode.)

vs. the 0 and near-0 counts for running together show that I succeeded in having these tasks run simultaneously on the same core, with hyperthreading, for basically the whole time (~1.2 seconds repeated twice, or 2.4 seconds).

So 5.0038G cycles / 10M iters / 20 rdtsc/iter = 25.019 cycles per RDTSC single-threaded, pretty much what Agner Fog measured.

Averaging across both processes for the HT test, that's about 5.244G cycles / 10M iter / 20 rdtsc/iter = 26.22 cycles on average.

So running RDTSC on both logical cores simultaneously on Skylake gives a nearly linear speedup, with very minimal competition for throughput resources. Whatever RDTSC bottlenecks on, it's not something that both threads compete for or slow each other down with.

Having the other core busy running high-throughput code (that could sustain 4 uops per clock if it had a core to itself) would probably hurt an RDTSC thread more than another thread that also just running RDTSC. Maybe we could even figure out if there's one specific port that RDTSC needs more than others, e.g. port 1 is easy to saturate because it's the only port that can run integer multiply instructions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Can you post the full `testloop` code? And what is the `times` instruction? Cannot find anything. probably because of its ambiguous name. – kuga Jun 04 '19 at 10:35
  • `times` is a NASM operator that repeats the instruction that many times. Like I said in my answer, [Can x86's MOV really be "free"? Why can't I reproduce this at all?](//stackoverflow.com/q/44169342) has the full source code, just replace the loop body with `times 20 rdtsc`. – Peter Cordes Jun 04 '19 at 17:32
  • It's amazing that something has superficially simple as `rdtsc` takes 20 μops. Anyone have an idea why this is the case? I would have expected it to just be reading some timestamp register – Brennan Vincent Jun 05 '19 at 00:53
  • @BrennanVincent: I was surprised, too. Maybe has something to do with virtualization being able to scale and offset it for guest VMs? (And even if not running inside a VM, it always decodes the same way.) – Peter Cordes Jun 05 '19 at 00:55
  • Don't know how recent CPU model should be but my laptop's `Intel Pentium T4300` with 2 cores increments `RDTSC` values with different frequencies, sometimes with rate of `0.48 ns/cycle` (most of time), sometimes `0.96 ns/cycle` (rarely when overheated). Measured this with C++ program and `__rdtsc()` intrinsic. I bought my laptop around year 2008. – Arty Feb 09 '21 at 13:40
  • Do you probably know what else besides `RDTSC` I can use to measure precise time? I need such operation that 1) very fast, meaning that using this operation takes just several CPU cycles at most 2) very precise, meaning that precision is not more than 5-10 nanoseconds. 3) it has same (unchangable) frequency on all CPUs (even old). 4) it may measure any units of time (nanoseconds, cycles, ticks, etc), as far as I can convert them to nanoseconds. – Arty Feb 09 '21 at 14:43
  • @Arty: First part answered in comments [How to get the CPU cycle count in x86\_64 from C++?](https://stackoverflow.com/posts/comments/116911171) where you mentioned the same thing. Besides RDTSC, there's also RDPMC which can be lower overhead, but the "cycles" event will be core cycles (and thus variable with frequency). RDTSC is the best you can get; most systems never throttle so `constant_tsc` is sufficient for fine-grained offsets from the last timer increment, as long as the CPU doesn't go to sleep. – Peter Cordes Feb 09 '21 at 21:16
  • @Arty: Before constant_tsc, there's no good option (both precise and low overhead); that's *why* CPU vendors repurposed RDTSC to be usable as a time source even with variable CPU frequency, because of software demand for such. Newer CPUs with `invariant_tsc` make it even better, not halting during sleep states. – Peter Cordes Feb 09 '21 at 21:17