What happens when QueryPerformanceCounter is called?

Question

I'm looking into the exact implications of using QueryPerformanceCounter in our system and am trying to understand it's impact on the application. I can see from running it on my 4-core single cpu machine that it takes around 230ns to run. When I run it on a 24-core 4 cpu xeon it takes around 1.4ms to run. More interestingly on my machine when running it in multiple threads they don't impact each other. But on the multi-cpu machine the threads cause some sort of interaction that causes them to block each other. I'm wondering if there is some shared resource on the bus that they all query? What exactly happens when I call QueryPerformanceCounter and what does it really measure?

omfg, 1.4**ms** ! yep, now that's a good question. according to here: http://msdn.microsoft.com/en-us/library/windows/desktop/dn553408(v=vs.85).aspx it takes 800ns as an example of a bad behaved machine. 1.4ms would be HYPER bad — v.oddou, Jul 30 '14 at 03:09
I'm also bitten by this while writing a profiler. There the timer is queried each time a function starts and returns. Using `QueryPerformanceCounter` slows down the program to the crawl. Using `GetTickCount` doesn't cause noticeable slowdown, but it's unusable for accurate profiling... — Calmarius, Dec 30 '14 at 23:48

score 10 · Accepted Answer · answered Nov 12 '09 at 17:04

Windows QueryPerformanceCounter() has logic to determine the number of processors and invoke syncronization logic if necessary. It attempts to use the TSC register but for multiprocessor systems this register is not guaranteed to be syncronized between processors (and more importantly can vary greatly due to intelligent downclocking and sleep states).

MSDN says that it doesn't matter which processor this is called on so you may be seeing extra syncronization code for such a situation cause overhead. Also remember that it can invoke a bus transfer so you may be seeing bus contention delays.

Try using SetThreadAffinityMask() if possible to bind it to a specific processor. Otherwise you might just have to live with the delay or you could try a different timer (for example take a look at http://en.wikipedia.org/wiki/High_Precision_Event_Timer).

istudy0 · Answer 2 · 2010-09-15T19:55:10.500

I know that this thread is a bit old but I would like to add more info. First, I do agree that QueryPerformanceCounter can take more time on certain machines, but I am not sure if Ron's answer is the reason for that all the time. While I was doing some research on this issue, I found a various web pages that talks about how QueryPerformanceCounter is implemented. For instance, Precision is not the same as accuracy tells me that Windows, HAL to be more specific would use different timing device to obtain the value. This means that if windows get to use slower timing device such as PIT, it will take more time to obtain the time value. Obviously, using PIT might require PCI transaction so that would be one reason.

I also found another article: How It Works: Timer Outputs in SQL Server 2008 R2 - Invariant TSC giving similar description. In fact, this article tells how SQLServer would time the transaction in the best way.

Then, I found more information on VMware site because I had to deal with customers who are using VMs and I found that there are other issues with time measurement with VMs. For those who are interested, please refer to VMware paper - Timekeeping in VMware Virtual Machines In this paper, it also talks about how some versions of windows would synchronize each TSCs. Thus, it would be safe to use QueryPerformanceCounter() in certain situations and I think that we should try something like what How It Works: Timer Outputs in SQL Server 2008 R2 suggested to find what might happen when we call QueryPerformanceCounter()

score 3 · Answer 3 · answered Nov 12 '09 at 16:49

3

I was under the impression that on x86 QueryPerformanceCounter() just called rdtsc under the covers. I'm suprised that it has any slowdown on multi-core machines (I've never noticed it on my 4-core cpu).

answered Nov 12 '09 at 16:49

Aaron

9,123
5
40
38

I don't know that it's a significant impact in practice and probably not measurable unless you are looking directly for it. On the 4-core cpu there is no slowdown at all anyways :) – Matt Price Nov 12 '09 at 16:56
@Goz while it was particularly true for old Opterons, newer multi core CPUs have synchronized TSCs registers. – v.oddou Sep 01 '14 at 03:54
@v.oddou: Probably true ... but is the timestamp counter not a simple cycle count? Does it take into account throttling? – Goz Sep 01 '14 at 06:38
@Goz yes I have wondered a lot about this issue. The only thing I can reasonably imagine is happenning in practice, is that all cores scale speed together. This is reflected in programs like "coretemp" or "CPUz" which displays only 1 frequency, but 4 temperatures and usage bars. – v.oddou Sep 01 '14 at 07:01

score 2 · Answer 4 · answered Nov 12 '09 at 17:06

It's been a long time since I used this much, but if memory serves there isn't one implementation of this function, as the guts are provided by the various hardware manufacturers.

Here is a small article from MSDN: http://msdn.microsoft.com/ja-jp/library/cc399059.aspx

Also, if you're querying performance across multiple CPUs (as opposed to multiple cores on one CPU), it's going to have to communicate across the bus, which is both slower and could be where you are seeing some blocking.

However, like I said before it's been quite a while.

Mike

What happens when QueryPerformanceCounter is called?

4 Answers4

Linked