The RDTSC
instruction is not "serializing"
, see this SO question
Why isn't RDTSC a serializing instruction?
Some background
Modern X86 cores have "Out-of-Order" (OoO) execution, which means instructions are dispatched to an execution unit
capable of executing the instruction as soon as the operands are ready and the execution unit is available... instructions are not necessarily executed in program order. Instruction do retire in program order, so you can get the precise contents of registers and memory that the in-order execution of the architecture specifies when an interrupt, exception, or a fault occurs.
What this means is the CPU is free to dispatch instructions for execution in whatever order it wants to get as much concurrency as possible and improve performance, as long as it gives the illusion that instructions executed in-order.
The RDTSC
instruction was designed to execute as fast as possible, to be as non-intrusive as possible with little overhead. It has about 22 processor cycle latency but you can get a lot of work done concurrently.
There is a newer variant, called RDTSCP
which is serializing... the processor waits for previous instructions in program order to finish, and prevents future instructions from being dispatched... this is expensive from a performance point of view.
Back to your question
With that in mind, think about what the compiler generates and what the processor sees... the while(true) is just an unconditional branch, it doesn't really execute but is consumed by the front end of the pipeline, the instruction decoder, which is fetching as far ahead as it can, cramming instructions into the instruction dispatchers to try to get as many instructions executing per cycle. So, the RDTSC
instructions in your loop are dispatched, other instructions keep flowing and executing, eventually the RDTSC
retires and the result is forwarded to instructions that depend on the result (a subtraction in your code). But you haven't really timed precisely the inner loop.
Let's look at the following code:
beginTime = rdtsc();
run_iter(it2->key(), it2->value1(), it2->value2(), it2->value3());
iterateTime += rdtsc()-beginTime;
The assumption is that the function run_iter()
will have finished when you call rdtsc()
after it returns. But what may really happen is that some load from memory in run_iter
misses in the cache, and the processor holds that load waiting on memory but it can continue executing independent instructions, it returns from the function (or the function was inlined by the compiler) and it sees the RDTSC
at the return, so it dispatches that... hey, it isn't dependent on the load that missed in the cache and it isn't serializing so it's fair game. The RDTSC
retires in 22 cycles, which is much faster than a cache miss that goes to DRAM (hundreds of cycles)... and all of a sudden you have under-reported the time it took to execute run_iter()
.
The outer loop measurement doesn't suffer from this, so it gives you the true overall time in cycles.
Suggested fix
Here's a simple helper struct/class that will allow you to account for time in various accumulators without "time leaks." Anytime you call the "split" member function you have to give it an accumulator variable, by reference, where it will accumulate the previous time interval:
struct Timer {
uint64_t _previous_tsc;
Timer() : _previous_tsc(rdtsc()) {}
void split( uint64_t & accumulator )
{
uint64_t tmp = rdtsc();
accumulator += tmp - _previous_tsc;
_previous_tsc = tmp;
}
};
Now you can use one instance to time the "splits" of your inner loop and another for the overall outer loop:
uint64_t flagtime = 0; // outer loop
uint64_t otherTime = 0; // inner split
uint64_t iterateTime = 0; // inner split
uint64_t loopTime = 0; // inner split
Timer tsc_outer;
Timer tsc_inner;
while(! it2->done()) {
tsc_inner.split( loopTime );
bool cont = it2->Next(); //if we have more in the state table, we continue
if(!cont) break;
totalF2+=it2->value2(); //for experiment, recording the sum of v
updates++; //for experiment, recording the number of updates
tsc_inner.split( otherTime );
run_iter(it2->key(), it2->value1(), it2->value2(), it2->value3());
tsc_inner.split( iterateTime );
}
tsc_outer.split( flagtime );
This is now "tight" you won't miss any cycles. One caveat though, it still uses RDTSC
instead of RDTSCP
so it isn't serializing, which means you might still under-report the time spent in one split (like iterateTime) while over-report some other accumulator (like loopTime). Cache misses in run_iter() that aren't accounted for in iterateTime will be accounted for in loopTime.
Note: the Hypervisor of the Virtual Machine may be trapping RDTSC
One thing to note is that in a virtual machine it is possible that the hypervisor sets a control register to force the CPU to fault when a user level program attempts to execute RDTSC
... that would definitely serialize execution, and would be a huge performance bottleneck. In these cases the hypervisor emulates
execution of RDTSC
and provides a virtual timestamp to the application. See SO question Weird program latency behavior on VM.
Initially I thought this wasn't the issue you were observing, I'm wondering now if it is. If in fact the virtual machine is trapping the RDTSC
then you have to add the overhead of the hardware saving the VM registers, dispatching the kernel/hypervisor, and resuming your application after "fixing" EDX:EAX to emulate the RDTSC
... 50 billion cycles is a long time, at 3 GHz that's over 16 seconds. That would explain why you had so much time missing... 11 billion cycles... (44 - 33).