Why aren't mfence;RDTSCP timings closer to equal across repeated calls after warm-up? +-2% variation over ~3300 TSC counts

Question

I am writing code to compare the execution times of different versions of a bigint add function, on AMD FX(tm)-8350 Eight-Core Processor 4.00 GHz. I need help to make sense of the machine behaviour that I’m observing.

The code that I’m timing performs some mathematical calculations. It is mostly CPU bound. By that I mean that the functions only access data that is embedded in the executable. So, no disk or any other external resources are needed.

I time the functions like this (this is on Windows):

  mfence
  rdtscp
  lfence
  shl      rdx,32
  lea      r15,[rdx+rax]    ; r15 = ticks before the function is entered

  mov      rcx,rbp      ; the only argument to the function
  call     [r13]        ; call the function being timed

  mfence
  rdtscp
  lfence
  shl      rdx,32
  lea      rbx,[rdx+rax]    ; rbx = ticks right after the timed function      

  sub      rbx,r15
  mov      [r12+r14*4],ebx  ; store ticks delta

Because I expect that at the beginning the processor is not up to speed, I keep on discarding the tick samples that I get as long as they are decreasing in value (which I believe it means that the CPU is ramping up its speed). When I get a sample which is equal or larger than the previous one I expect that the CPU has got to its max speed. I retain that sample and the previous (smaller) one and from then on I collect another 18 samples for a total of 20.

My first question is: shouldn’t I expect to see the timings decrease in value as long as the CPU is getting up to speed, and from then on to see EQUAL timings? Of course, an IRQ could trigger, (unlikely:) a context switch could happen, some other process could screw up my cache, etc. But if any of that happens, shouldn’t I see a sequence of equal timings followed by a large spike? Alas, that is not what I see! Here’s an example:

Discarded 7 initial ticks:
    34971     4908     4189     3759     3603     3483     3432
Unordered retained ticks:
     3370     3431     3388     3387     3388     3388     3387     3374     3373     3373     3388     3388     3387     3388     3387     3388     3388     3387     3388     3388     3387     3388     3387     3374     3373     3373     3388     3388     3387     3388     3387     3388     3388     3387     3388     3388     3387     3388     3387     3374     3373     3373     3388     3388     3387     3388     3388     3387     3388     3387

I have no problem with values separated by one tick, like 3387 and 3388, or 3373 and 3374. That is normal jitter. But why should 3387/3388 be followed by 3373/3374 and then go back to the higher value? Definitely there was no interrupt there coming to mess up things for 14 ticks.

But the main question comes from this observation. In the above example the most common measured value is 3388 (or, which is the same: 3387).

But when I repeat the test many times, the most common values I’ve observed were (in the order observed):

3350, 3312, 3310, 3295, 3286, 3334, 3440, 3344, 3300, 3311, 3306, 3331

Between the largest and the smallest of those values there is a difference of 4%. Why should timing of a function be so inaccurate? Why should a function which uses exactly the same data, the same resources, vary its execution time like that?

What puzzles me is not that I measure different times. If it was the case that between one sample and the next I could see a 4% difference I could just accept that the hardware has that kind of resolution.

But why should, at a particular run, the great majority of the timings be 3350, and at the next run most of them be 3312?

I’m eager to get some insight into this, to help me decide how to best compare different execution times.

What does your test function do? Is it pure ALU work, or are there some cache misses and/or branch mispredicts? What CPU are you timing on? If it's AMD, max turbo might be constantly varying. (The TSC counts at a fixed frequency, so even if everything else was perfectly simple, you would expect maybe +-2 TSC counts from starting / stopping the clock.) — Peter Cordes, Apr 15 '23 at 20:01
Normally if you want to microbenchmark something, you put it in a repeat loop and time multiple repeats. In asm you can trivially be sure there isn't a compiler optimizing across iterations, and if lets out-of-order / superscalar exec do its job. (You have to choose whether to benchmark throughput vs. latency, of course, by creating a data dependency from the output of one call to the input of the next.) [Assembly - How to score a CPU instruction by latency and throughput](https://stackoverflow.com/q/52260625) (also applies to blocks of code the size you're doing.) — Peter Cordes, Apr 15 '23 at 20:06
@Peter Cordes: I’m operating with large precision numbers, expressed in arrays of several 64 bit words (30 in this particular case, could go to a few hundreds). So, two arrays are read sequentially, and the result array is written to. So, yes, there would be cache misses and branch mispredicts, but I’m not trying to measure that. The timing code I show above does run in a loop. That’s how I collected the 20 samples I used as example. — user1752563, Apr 15 '23 at 20:10
I'm running on an AMD FX(tm)-8350 Eight-Core Processor 4.00 GHz — user1752563, Apr 15 '23 at 20:13
You're timing each call separately, though, rather than one time for 20 calls which would let CPUs overlap the dependency chains of two different bigint adds. Here you're serializing execution (via `lfence` and `rdtscp`) between each chunk of work. Since that doesn't happen during real workloads, it's possibly a distortion to optimize for that. (It's still interesting that you don't get perfectly consistent results from it, but it's not how I'd benchmark to see if I was making the code faster. At least not the *only* way.) — Peter Cordes, Apr 15 '23 at 20:14
In real life, say you needed to calculate the cosine of some 50 word big number, the code I’m timing now would be used to do a mult, then the result will be added to something, then that divided by something else, and so on. Each of these operations involves over a thousand lines of C code. Therefore, but I may be wrong (because I’m quite ignorant of the matter), latency is not relevant here. In that scenario there would be many more cache misses than what I can experience by running the same function in a loop, with the same data. — user1752563, Apr 15 '23 at 20:29
50 words will easily fit in L1d cache; you should get mostly cache hits. Even the store buffer has more than 50 entries in modern x86 CPUs, so that many stores can be in flight at once. Out-of-order exec might be hard-pressed to see past a 50x50-word bigint multiply until it was pretty close to the end, but could see past a 50-word bigint add and start using the low words of the result in the next multiply before it had finished executing the final `adc` instructions that propagate the carry to the top word. — Peter Cordes, Apr 15 '23 at 20:40
But yes, the whole cosine calculation is big enough compared to the ROB / scheduler size that it's not too much of a distortion to just time it by draining the pipeline at the end. And the CPU can probably find enough instruction-level parallelism within it. (Especially an old AMD Bulldozer-family with its weak integer cores that aren't very wide.) — Peter Cordes, Apr 15 '23 at 20:42
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253171/discussion-between-user1752563-and-peter-cordes). — user1752563, Apr 15 '23 at 21:16

Why aren't mfence;RDTSCP timings closer to equal across repeated calls after warm-up? +-2% variation over ~3300 TSC counts

0 Answers0