I am writing code to compare the execution times of different versions of a bigint add function, on AMD FX(tm)-8350 Eight-Core Processor 4.00 GHz. I need help to make sense of the machine behaviour that I’m observing.
The code that I’m timing performs some mathematical calculations. It is mostly CPU bound. By that I mean that the functions only access data that is embedded in the executable. So, no disk or any other external resources are needed.
I time the functions like this (this is on Windows):
mfence
rdtscp
lfence
shl rdx,32
lea r15,[rdx+rax] ; r15 = ticks before the function is entered
mov rcx,rbp ; the only argument to the function
call [r13] ; call the function being timed
mfence
rdtscp
lfence
shl rdx,32
lea rbx,[rdx+rax] ; rbx = ticks right after the timed function
sub rbx,r15
mov [r12+r14*4],ebx ; store ticks delta
Because I expect that at the beginning the processor is not up to speed, I keep on discarding the tick samples that I get as long as they are decreasing in value (which I believe it means that the CPU is ramping up its speed). When I get a sample which is equal or larger than the previous one I expect that the CPU has got to its max speed. I retain that sample and the previous (smaller) one and from then on I collect another 18 samples for a total of 20.
My first question is: shouldn’t I expect to see the timings decrease in value as long as the CPU is getting up to speed, and from then on to see EQUAL timings? Of course, an IRQ could trigger, (unlikely:) a context switch could happen, some other process could screw up my cache, etc. But if any of that happens, shouldn’t I see a sequence of equal timings followed by a large spike? Alas, that is not what I see! Here’s an example:
Discarded 7 initial ticks:
34971 4908 4189 3759 3603 3483 3432
Unordered retained ticks:
3370 3431 3388 3387 3388 3388 3387 3374 3373 3373 3388 3388 3387 3388 3387 3388 3388 3387 3388 3388 3387 3388 3387 3374 3373 3373 3388 3388 3387 3388 3387 3388 3388 3387 3388 3388 3387 3388 3387 3374 3373 3373 3388 3388 3387 3388 3388 3387 3388 3387
I have no problem with values separated by one tick, like 3387 and 3388, or 3373 and 3374. That is normal jitter. But why should 3387/3388 be followed by 3373/3374 and then go back to the higher value? Definitely there was no interrupt there coming to mess up things for 14 ticks.
But the main question comes from this observation. In the above example the most common measured value is 3388 (or, which is the same: 3387).
But when I repeat the test many times, the most common values I’ve observed were (in the order observed):
3350, 3312, 3310, 3295, 3286, 3334, 3440, 3344, 3300, 3311, 3306, 3331
Between the largest and the smallest of those values there is a difference of 4%. Why should timing of a function be so inaccurate? Why should a function which uses exactly the same data, the same resources, vary its execution time like that?
What puzzles me is not that I measure different times. If it was the case that between one sample and the next I could see a 4% difference I could just accept that the hardware has that kind of resolution.
But why should, at a particular run, the great majority of the timings be 3350, and at the next run most of them be 3312?
I’m eager to get some insight into this, to help me decide how to best compare different execution times.