5

I work on programming language profiler and I am looking for a timer solution for Windows with better than 100 ns resolution.

  • QueryPerformanceCounter should be an answer, but the returned frequency by QueryPerformanceFrequency is 10 MHz on Windows 10 and even less on Windows 7

  • GetSystemTimePreciseAsFileTime has 100 ns tick/step

  • RDTSC has resolution better than 1ns, but it varies with frequency

My target resolution is at least 10 ns.

What is currently the best solution?

How QueryPerformanceCounter is implemented? can it be easily disassembed and the resolution increased?

Is it somehow possible to use RDTSC directly and track/interrupt on every frequency change?

mvorisek
  • 3,290
  • 2
  • 18
  • 53
  • 1
    Run the target several thousands/millions of times and divide by the number of runs? Use an external hardware timer (this is a really good idea if you're timing response times to external stimuli - "how long after this input is received is the output produced?")? – JohnFilleau Aug 01 '20 at 12:41
  • Averaging might help when `RDTSC` is used, but if the frequency will scale not randomly (due thermal throtling), this will not help. Averaging of `QueryPerformanceCounter` will not help significantly, as the shortest events are about 100 ns. It is software profiler that should run on general hardware, thus hardware times is also not an option. – mvorisek Aug 01 '20 at 12:53
  • 2
    https://www.intel.com/content/www/us/en/embedded/training/ia-32-ia-64-benchmark-code-execution-paper.html – 0___________ Aug 01 '20 at 13:03
  • Are you interested in profiling the number of clock cycles to complete some work, the wall time (real world end time minus real world start time), or cpu time (time spent doing work on the CPU). If you're concerned about dynamic frequency scaling, then putting the program under a large load of a million-strong loop might even help you get a more conservative estimate. I don't suggest setting a million timers, adding them up, then dividing. I suggest setting ONE timer, starting it, doing a million runs, and then ending it. If the extra loop time is not negligible add more work to fix that. – JohnFilleau Aug 01 '20 at 13:06
  • Also, are you interested in making Real Time Guarantees™ about this profiling? Because if so, there be dragons when using frequency scaling, multitasking, pipelining, branch prediction, and a non-real time operating system, etc. The only sane way I've heard to make RTGs™ is to turn OFF all of the optimization techniques (which may not be available on your processor) but sometimes it's necessary because a task arriving too soon can cause your scheduling algorithm to get confused (especially if you're running harmonic periods at 100% load). – JohnFilleau Aug 01 '20 at 13:10
  • While @P__J__ 's link is targeted at Linux systems, there may be some useful information in there theory wise. +1 for a technical document. – JohnFilleau Aug 01 '20 at 13:18
  • @JohnFilleau same Windows - easier to get the register `__rdtsc();` Modern x86 uP keep this clock frequency stable – 0___________ Aug 01 '20 at 13:42
  • 1
    @mvorisek `but if the frequency will scale not randomly` - no, it was correct 20y ago. As for now it is a stable frequency. Read chip documentation for details – 0___________ Aug 01 '20 at 13:46
  • @P__J__ the paper you linked specifically says frequency scaling had to be turned off? "All power optimization, Intel Hyper-Threading technology, frequency scaling and turbo mode functionalities were turned off." – JohnFilleau Aug 01 '20 at 13:48
  • @JohnFilleau read about the uPs with the ` TCPInvariant` bit set. Then you will learn that this counter does not depends on the core clock which may vary. Read more about it. There are a lotts of resources about it. Now you do not have to turn anything off – 0___________ Aug 01 '20 at 14:57
  • https://learn.microsoft.com/en-us/windows/win32/api/realtimeapiset/nf-realtimeapiset-querythreadcycletime – Hans Passant Aug 01 '20 at 15:05

1 Answers1

4

How QueryPerformanceCounter is implemented?

QPC timer has different implementations in the HAL depending on hardware; it uses TSC, HPET, RTC, APIC, ACPI or 8254 timers, depending on availability.

QPC timer resolution is hardcoded to 100ns. But it doesn't matter because the call to QPC itself takes >100ns. 100ns is just a very, very short amount of time in Windows world.

RDTSC has resolution better than 1ns, but it varies with frequency

Not really, the TSC frequency is actually pretty stable since Nehalem. See Intel 64 Architecture SDM vol. 3A, "17.16 Invariant TSC":

Processor families increment the time-stamp counter differently:

  • For Pentium M processors (family [06H], models [09H, 0DH]); for Pentium 4 processors, Intel Xeon processors (family [0FH], models [00H, 01H, or 02H]); and for P6 family processors: the time-stamp counter increments with every internal processor clock cycle. The internal processor clock cycle is determined by the current core-clock to bus-clock ratio. Intel SpeedStep technology transitions may also impact the processor clock.

  • For Intel Xeon processors (family [0FH], models [03H and higher]); for Intel Core Solo and Intel Core Duo processors (family [06H], model [0EH]); for the Intel Xeon processor 5100 series and Intel Core 2 Duo processors (family [06H], model [0FH]); for Intel Core 2 and Intel Xeon processors (family [06H], DisplayModel [17H]); for Intel Atom processors (family [06H], DisplayModel [1CH]): the time-stamp counter increments at a constant rate. That rate may be set by the maximum core-clock to bus-clock ratio of the processor or may be set by the maximum resolved frequency at which the processor is booted. The maximum resolved frequency may differ from the processor base frequency, see Section 18.18.2 for more detail. On certain processors, the TSC frequency may not be the same as the frequency in the brand string.

The time stamp counter in newer processors may support an enhancement, referred to as invariant TSC. Processor’s support for invariant TSC is indicated by CPUID.80000007H:EDX[8]. The invariant TSC will run at a constant rate in all ACPI P-, C-. and T-states. This is the architectural behavior moving forward. On processors with invariant TSC support, the OS may use the TSC for wall clock timer services (instead of ACPI or HPET timers). TSC reads are much more efficient and do not incur the overhead associated with a ring transition or access to a platform resource.

So for quick measurements you should be able to use __rdtsc or __rdtscp. You can calibrate for the TSC frequency at startup time and ensure it doesn't depend on CPU states. The thread could still be preempted though, so it's good to repeat the measurement multiple times or use QueryThreadCycleTime (though of course it comes with its own overhead). In practice I find RDTSC not as bad as it is presented in Calculate system time using rdtsc, though the latter is still a good read.

rustyx
  • 80,671
  • 25
  • 200
  • 267
  • thank you for your answer! I will test these areas more. FYI `QueryPerformanceCounter` takes about 15ns per call – mvorisek Aug 01 '20 at 17:25
  • That's interesting. On the system I tested last time, i7-7700HQ, it was taking 120ns per call. – rustyx Aug 01 '20 at 19:33
  • 1
    Nehalem apparently [introduced the feature of TSC not stopping in any low-power C-state ([Can constant non-invariant tsc change frequency across cpu states?](https://stackoverflow.com/q/62492053)). The doc you quoted confirms that even Core 1 and Core 2, and all x86-64 Intel, had fixed TSC frequency (when it is ticking). And yes, TSC measures wall-clock time, not cpu-time for your process, because of possible interrupts and context switches. HW perf counters (`rdpmc`) programmed to only tick in user-space can do better, but that's harder to set up (especially on Windows I assume). – Peter Cordes Aug 01 '20 at 21:34