3

My understanding of CPU time is that it should always be the same between every execution, on a same machine. It should require an identical amount of cpu cycles every time.

But I'm running some tests now, of executing a basic echo "Hello World", and it's giving me 0.003 to 0.005 seconds.

Is my understanding of CPU time wrong, or there's an issue in my measurement?

Erndob
  • 2,512
  • 20
  • 32
  • 1
    You are comparing code of a high level programming language to something low level like a CPU, thats flawed. There are many things going on in-between your code and what the CPU actually gets. Also, there is CPU caching which speeds up things like crazy. If you could force the CPU to get the exact same instructions and could deactivate factors like caching or hardware temperature, then yes, you would get the same cycles and thus approximately the same speed. Also, you are measuring your code on your OS, not isolated. So other programs are also running and taking CPU time. – Zabuzard Jan 19 '20 at 10:59
  • Extending that, there is also the OS scheduler that might decide to give your program less CPU time. – Zabuzard Jan 19 '20 at 11:03
  • The biggest effects on modern systems are: virtual memory lazily paging in code and data from disk if it's not hot in pagecache, and that **CPU frequency isn't fixed.** (idle / turbo speeds. `grep MHz /proc/cpuinfo`) – Peter Cordes Jan 19 '20 at 21:10

2 Answers2

4

Your understanding is completely wrong. Real-world computers running modern OSes on modern CPUs are not simple, theoretical abstractions. There are all kinds of factors that can affect how much CPU time code requires to execute.

Consider memory bandwidth. On a typical modern machine, all the tasks running on the machine's cores are competing for access to the system memory. If the code is running at the same time code on another core is using lots of memory bandwidth, that may result in accesses to RAM taking more clock cycles.

Many other resources are shared as well, such as caches. Say the code is frequently interrupted to let other code run on the core. That will mean that the code will frequently find the cache cold and take lots of cache misses. That will also result in the code taking more clock cycles.

Let's talk about page faults as well. The code itself may be in memory or it may not be when the code starts running. Even if the code is in memory, you may or may not take soft page faults (to update the operating system's tracking of what memory is being actively used) depending on when that page last took a soft page fault or how long ago it was loaded into RAM.

And your basic hello world program is doing I/O to the terminal. The time that takes can depend on what else is interacting with the terminal at the time.

David Schwartz
  • 179,497
  • 17
  • 214
  • 278
  • ... that being said, in a hyptothetic environment, excluding all external factors, ensuring that the CPU instructions are the same, then theoretically you would get the same amount of cycles and theoretically the same speed. – Zabuzard Jan 19 '20 at 11:04
  • 1
    Only if you define "external factors" as anything that can change the number of cycles the code takes, in which case it's just a tautology. Why is the OS's page activity tracking an "external factor"? Try changing the window size on a Windows program that's doing output to a text window. – David Schwartz Jan 19 '20 at 11:05
  • I fully agree. I just think that is the flaw OP had in his thinking. So there is some truth behind it, if there wouldnt be any of these millions of factors in-between his code and the CPU _just_ executing it. – Zabuzard Jan 19 '20 at 11:07
  • 1
    A very common reasoning error, particularly among programmers with less than two years of experience, is thinking that the actual real-world computer they are working with will behave like a simple, idealized theoretical model of a computer. Real-world computers running modern operating systems with modern CPUs are fiendishly complex beasts. – David Schwartz Jan 19 '20 at 11:09
  • @Zabuza: Even controlling for everything else (caches hot, and sometimes even in a loop with no data loads/stores), the same loop might have two stable states for how the CPU schedules instructions. Different entry condition quirks can lead to an ongoing speed difference over millions of loop iterations. I've seen this occasionally when microbenchmarking stuff on modern Intel CPUs like Sandybridge and Skylake. It's usually not clear exactly what the two stable states are, and what exactly is causing the bottleneck, even with the help of performance counters and https://agner.org/optimize/ – Peter Cordes Jan 19 '20 at 11:29
  • e.g. @BeeOnRope and I have seen this sometimes. In one case I remember, an interrupt tended to get the loop into the efficient mode of execution. So @Bee was measuring slow cycles/iteration using RDTSC or RDPMC for a short interval, while I was measuring fast by using a really large repeat count and just using `perf stat` on the whole program (which was a static executable with just that one loop written by hand in asm). And @Bee was able to repro my results by increasing the iteration count. – Peter Cordes Jan 19 '20 at 11:32
  • Got it. Thanks. The comments were useful too. From the definition, I've wrongly assumed that stuff like memory access and other I/O is not part of CPU time. – Erndob Jan 19 '20 at 12:03
3

The biggest effects on modern systems include:

  • virtual memory lazily paging in code and data from disk if it's not hot in pagecache. (First run of a program tends to have a lot more startup overhead.)
  • CPU frequency isn't fixed. (idle / turbo speeds. grep MHz /proc/cpuinfo).
  • CPU caches can be hot or not
  • (for very short intervals) an interrupt randomly happening or not in your timed region.

So even if cycles were fixed (which they very much are not), you wouldn't see equal times.

Your assumption is not totally wrong, but it only applies to core clock cycles for individual loops, and only to cases that don't involve any memory access. (e.g. data already hot in L1d cache, code already hot in L1i cache inside a CPU core). And assuming no interrupt happens while the timed loop is running.

Running a whole program is a much larger scale of operation and will involve shared resources (and possible contention for them) like access to main memory. And as @David pointed out, a write system call to print a string on a terminal emulator - that communication with another process can be slow and involves waking up another process, if your program ends up waiting for it. Redirecting to /dev/null or a regular file would remove that, or just closing stdout like ./hello >&- would make your write system call return -EBADF (on Linux).

Modern CPUs are very complex beasts. You presumably have an Intel or AMD x86-64 CPU with out-of-order execution, and a dozen or so buffers for incoming / outgoing cache lines, allowing it to track about that many outstanding cache misses (memory-level parallelism). And 2 levels of private cache per core, and a shared L3 cache. Good luck predicting an exact number of clock cycles for anything but the most controlled conditions.

But yes, if you do control the condition, the same small loop will typically run at the same number of core clock cycles per iteration.

However, even that's not always the case. I've seen cases where the same loop seems to have have two stable states for how the CPU schedules instructions. Different entry condition quirks can lead to an ongoing speed difference over millions of loop iterations.

I've seen this occasionally when microbenchmarking stuff on modern Intel CPUs like Sandybridge and Skylake. It's usually not clear exactly what the two stable states are, and what exactly is causing the bottleneck, even with the help of performance counters and https://agner.org/optimize

In one case I remember, an interrupt tended to get the loop into the efficient mode of execution. @BeeOnRope was measuring slow cycles/iteration using or RDPMC for a short interval (or maybe RDTSC with core clock fixed = TSC reference clocks), while I was measuring it running faster by using a really large repeat count and just using perf stat on the whole program (which was a static executable with just that one loop written by hand in asm). And @Bee was able to repro my results by increasing the iteration count so an interrupt would happen inside the timed region, and returning from the interrupt tended to get the CPU out of that non-optimal uop-scheduling pattern, whatever it was.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    There was a reverse case too, where after an interrupt things were slower (at least after an interrupt that does XSAVE/xrestor). One example is that addressing is considered simple even if indexed, as long as the index is zero by a zeroing idiom but this does not survive xrestor. – BeeOnRope Jan 20 '20 at 02:31
  • Interrupts / context switches save GP-integer regs with push/pop (which also destroys known-zeroness). [`xrstor`](https://www.felixcloutier.com/x86/xrstor) doesn't touch GP-integer regs. Interesting; I didn't know a zeroed index register could avoid unlamination. That implies unlamination doesn't happen until after consulting the RAT. Or are you talking about the load-use latency special case for `[reg + 0..2047]` addressing modes, using the bare reg for TLB lookup? ([Is there a penalty when base+offset is in a different page than the base?](//stackoverflow.com/q/52351397)) – Peter Cordes Jan 20 '20 at 02:39