1

I use rdstc() function to test the execution time of a while-loop separately from outside of it and inside of it, the two results have a large differences. When I test from outside of it, the result turns to be around 44.5 billion cycles. When I test from inside of it, the result turns to be around 33 billion cycles.

The code segments are showed below:

while(true){
    beginTime = rdtsc();
    typename TypedGlobalTable<K, V, V, D>::Iterator *it2 = a->get_typed_iterator(current_shard(), false);
    getIteratorTime += rdtsc()-beginTime;
    if(it2 == NULL) break;

    uint64_t tmp = rdtsc();
    while(true) {
        beginTime = rdtsc();
        if(it2->done()) break;      
        bool cont = it2->Next();        //if we have more in the state table, we continue
        if(!cont) break;
        totalF2+=it2->value2();         //for experiment, recording the sum of v
        updates++;                      //for experiment, recording the number of updates
        otherTime += rdtsc()-beginTime;
        //cout << "processing " << it2->key() << " " << it2->value1() << " " << it2->value2() << endl;
        beginTime = rdtsc();
        run_iter(it2->key(), it2->value1(), it2->value2(), it2->value3());
        iterateTime += rdtsc()-beginTime;
    }
    flagtime += rdtsc()-tmp;
    delete it2;                         //delete the table iterator}

The while-loop I tested for is the inner one.

The rdstc() function is showed below:

static uint64_t rdtsc() {

  uint32_t hi, lo;

  __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));

  return (((uint64_t)hi)<<32) | ((uint64_t)lo);

}

I build and run this program under Ubuntu 10.04LTS in a virtual machine, the kernel version is "Linux ubuntu 2.6.32-38-generic #83-Ubuntu SMP Wed Jan 4 11:13:04 UTC 2012 i686 GNU/Linux".

  • Can u show the disassembly of this `uint64_t tmp = rdtsc(); while(true) {` Since I dont have compiler right here . – Coldsteel48 Mar 25 '14 at 08:22
  • So, you are basically saying that `otherTIme` + `iterateTime` is different from the total time of the entire loop? Yes, quite plausible. Also worth bearing in mind that TSC can vary between CPU's, so if your code is not running on the same CPU-core for the entire time, it may cause problems. – Mats Petersson Mar 25 '14 at 08:35
  • Please make your code segment is compilable. At least close all blocks you opened. In addition, you seem to have superfluous initialisations for your time marks... Likely you don't actually compare a timespan over the whole inner loop with comparable added chunks... Please say exactly which vars you compared in the end. – Deduplicator Mar 25 '14 at 08:41

1 Answers1

3

The RDTSC instruction is not "serializing", see this SO question

Why isn't RDTSC a serializing instruction?

Some background

Modern X86 cores have "Out-of-Order" (OoO) execution, which means instructions are dispatched to an execution unit capable of executing the instruction as soon as the operands are ready and the execution unit is available... instructions are not necessarily executed in program order. Instruction do retire in program order, so you can get the precise contents of registers and memory that the in-order execution of the architecture specifies when an interrupt, exception, or a fault occurs.

What this means is the CPU is free to dispatch instructions for execution in whatever order it wants to get as much concurrency as possible and improve performance, as long as it gives the illusion that instructions executed in-order.

The RDTSC instruction was designed to execute as fast as possible, to be as non-intrusive as possible with little overhead. It has about 22 processor cycle latency but you can get a lot of work done concurrently.

There is a newer variant, called RDTSCP which is serializing... the processor waits for previous instructions in program order to finish, and prevents future instructions from being dispatched... this is expensive from a performance point of view.

Back to your question

With that in mind, think about what the compiler generates and what the processor sees... the while(true) is just an unconditional branch, it doesn't really execute but is consumed by the front end of the pipeline, the instruction decoder, which is fetching as far ahead as it can, cramming instructions into the instruction dispatchers to try to get as many instructions executing per cycle. So, the RDTSC instructions in your loop are dispatched, other instructions keep flowing and executing, eventually the RDTSC retires and the result is forwarded to instructions that depend on the result (a subtraction in your code). But you haven't really timed precisely the inner loop.

Let's look at the following code:

beginTime = rdtsc();
run_iter(it2->key(), it2->value1(), it2->value2(), it2->value3());
iterateTime += rdtsc()-beginTime;

The assumption is that the function run_iter() will have finished when you call rdtsc() after it returns. But what may really happen is that some load from memory in run_iter misses in the cache, and the processor holds that load waiting on memory but it can continue executing independent instructions, it returns from the function (or the function was inlined by the compiler) and it sees the RDTSC at the return, so it dispatches that... hey, it isn't dependent on the load that missed in the cache and it isn't serializing so it's fair game. The RDTSC retires in 22 cycles, which is much faster than a cache miss that goes to DRAM (hundreds of cycles)... and all of a sudden you have under-reported the time it took to execute run_iter().

The outer loop measurement doesn't suffer from this, so it gives you the true overall time in cycles.

Suggested fix

Here's a simple helper struct/class that will allow you to account for time in various accumulators without "time leaks." Anytime you call the "split" member function you have to give it an accumulator variable, by reference, where it will accumulate the previous time interval:

struct Timer {
    uint64_t _previous_tsc;
    Timer() : _previous_tsc(rdtsc()) {}
    void split( uint64_t & accumulator )
    {
        uint64_t tmp = rdtsc();
        accumulator += tmp - _previous_tsc;
        _previous_tsc = tmp;
    }
};

Now you can use one instance to time the "splits" of your inner loop and another for the overall outer loop:

uint64_t flagtime    = 0; // outer loop

uint64_t otherTime   = 0; // inner split
uint64_t iterateTime = 0; // inner split
uint64_t loopTime    = 0; // inner split

Timer tsc_outer;
Timer tsc_inner;

while(! it2->done()) {

    tsc_inner.split( loopTime );

    bool cont = it2->Next();        //if we have more in the state table, we continue
    if(!cont) break;
    totalF2+=it2->value2();         //for experiment, recording the sum of v
    updates++;                      //for experiment, recording the number of updates

    tsc_inner.split( otherTime );

    run_iter(it2->key(), it2->value1(), it2->value2(), it2->value3());

    tsc_inner.split( iterateTime );
}
tsc_outer.split( flagtime );

This is now "tight" you won't miss any cycles. One caveat though, it still uses RDTSC instead of RDTSCP so it isn't serializing, which means you might still under-report the time spent in one split (like iterateTime) while over-report some other accumulator (like loopTime). Cache misses in run_iter() that aren't accounted for in iterateTime will be accounted for in loopTime.

Note: the Hypervisor of the Virtual Machine may be trapping RDTSC

One thing to note is that in a virtual machine it is possible that the hypervisor sets a control register to force the CPU to fault when a user level program attempts to execute RDTSC... that would definitely serialize execution, and would be a huge performance bottleneck. In these cases the hypervisor emulates execution of RDTSC and provides a virtual timestamp to the application. See SO question Weird program latency behavior on VM.

Initially I thought this wasn't the issue you were observing, I'm wondering now if it is. If in fact the virtual machine is trapping the RDTSC then you have to add the overhead of the hardware saving the VM registers, dispatching the kernel/hypervisor, and resuming your application after "fixing" EDX:EAX to emulate the RDTSC... 50 billion cycles is a long time, at 3 GHz that's over 16 seconds. That would explain why you had so much time missing... 11 billion cycles... (44 - 33).

Community
  • 1
  • 1
amdn
  • 11,314
  • 33
  • 45
  • Really appreciate for your detailed answer. And what method can I use to accurately test the execution time in this case? – FabulousMagi Mar 25 '14 at 10:35
  • Besides, according to your explaination, it seems that the execution time tested by rdstc() would be erratic, but the result I've got every time is almost the same. Why did it happen? – FabulousMagi Mar 25 '14 at 10:42
  • I've added a suggestion on how to accumulate time without missing cycles... I haven't actually compiled it, so I may have missed something in the syntax... let me know if you try it out. – amdn Mar 25 '14 at 10:46
  • 1
    I have tried your solution, It really works. I removed the "loopTime" part, and I've got otherTime = 18.2873 billion cycles, iterateTime = 31.7971 billion cycles, flagtime = 50.0909 billion cycles. The sum of otherTime and iterateTime is almost equal to the flagtime. But it seems that the previous problem is related to the otherTime instead of iterateTime. – FabulousMagi Mar 25 '14 at 11:21
  • The previous problem is that there were cycles that weren't being accumulated anywhere, because it seemed unnecessary to time branching back to the top of the loop... from the time of "iterateTime += rdtsc()-beginTime;" to the top of the inner loop "beginTime = rdtsc();" are cycles that aren't accumulated anywhere... if `RDTSC` were serializing then those cycles would be negligible, just the time to do an unconditional branch (never miss-predicted)... but because of potentially lingering cache misses eventually the CPU stalls... and you never counted that stall time. – amdn Mar 25 '14 at 11:30
  • I'm wondering now if in fact the virtual machine is trapping the `RDTSC`... 50 billion cycles is a long time, at 3 GHz that's over 16 seconds. That would explain why you had so much time missing... 11 billion cycles... (44 - 33). How many times do you iterate - in other words, what's the value of the `updates` counter? See http://stackoverflow.com/questions/22579864/weird-program-latency-behavior-on-vm – amdn Mar 25 '14 at 11:40
  • So some part of the cycles that should belong to run_iter() function are accumulated onto otherTime, I suppose? The updates counter is 4926150. – FabulousMagi Mar 25 '14 at 11:51
  • Yes, if you only access the `RDTSC` through the Timer class, then all cycles get accounted for, but because of lack of serialization they may get accounted to the "wrong" accumulator. It is likely the `RDTSC` is being trapped... 11 billion missing cycles divided by 5 million loop iterations means you have 2,200 "missing" CPU cycles per loop iteration... that sounds right for a round-trip to the hypervisor and back to your app. – amdn Mar 25 '14 at 12:07
  • Can I fix my problem by using RDTSCP instead of RDTSC? As you mentioned above. – FabulousMagi Mar 25 '14 at 12:28
  • The RDTSCP instruction is supported in newer processors. On Redhat distribution you can check whether your processor supports it like this cat /proc/cpuinfo | tr ' ' '\n'|grep rdtscp This will fix the serialization issue (it will further slow down your application). See this SO question on how to use it http://stackoverflow.com/questions/14783782/which-inline-assembly-code-is-correct-for-rdtscp But I suspect the VM is trapping your use of RDTSC (and therefore also RDTSCP). That would be a bigger issue, you'd have to see if the Hypervisor can be configured to give you native access. – amdn Mar 25 '14 at 12:32
  • I have tried replacing RDTSC with RDTSCP, the result turns to be have no change. – FabulousMagi Mar 25 '14 at 12:39
  • Then most certainly your issue is overhead of the VM `virtualizing` the RDTSC/RDTSCP family of instructions. It sets bit #2 of Control Register #4 (CR4.TSD) "Time Stamp Disable" to disallow you from reading it directly, then it spends 2,200 cycles to provide your virtual machine with a virtual value... you may be able to change that behavior, see http://xenbits.xen.org/docs/4.2-testing/misc/tscmode.txt – amdn Mar 25 '14 at 12:43
  • I install Ubuntu13.10(64bit) on a real PC and run my program on it, the previous problem still exists. – FabulousMagi Mar 26 '14 at 14:59
  • great answer! love the detail. – Justin Grant Feb 04 '16 at 01:00