Why does this delay-loop start to run faster after several iterations with no sleep?

Question

Consider:

#include <time.h>
#include <unistd.h>
#include <iostream>
using namespace std;

const int times = 1000;
const int N = 100000;

void run() {
  for (int j = 0; j < N; j++) {
  }
}

int main() {
  clock_t main_start = clock();
  for (int i = 0; i < times; i++) {
    clock_t start = clock();
    run();
    cout << "cost: " << (clock() - start) / 1000.0 << " ms." << endl;
    //usleep(1000);
  }
  cout << "total cost: " << (clock() - main_start) / 1000.0 << " ms." << endl;
}

Here is the example code. In the first 26 iterations of the timing loop, the run function costs about 0.4 ms, but then the cost reduces to 0.2 ms.

When the usleep is uncommented, the delay-loop takes 0.4 ms for all runs, never speeding up. Why?

The code is compiled with g++ -O0 (no optimization), so the delay loop isn't optimized away. It's run on Intel(R) Core(TM) i3-3220 CPU @ 3.30 GHz, with 3.13.0-32-generic Ubuntu 14.04.1 LTS (Trusty Tahr).

You should probably check the result of `usleep()` as it may be interrupted or it may do nothing because your parameter isn't valid. This would make any timings unreliable. — John3136, Jul 11 '16 at 04:11
@John3136: The usleep is outside the timing window. He's timing a busy-loop either back to back or separated by 1ms sleeps. — Peter Cordes, Jul 11 '16 at 06:00
For benchmarking purposes, you should compile with `gcc -O2` or (since your code is C++) `g++ -O2` at least. — Basile Starynkevitch, Jul 11 '16 at 08:02
@BasileStarynkevitch: Problem with that in the present example: At any optimization level, the compiler would dead-strip the whole code. (Which leads to the question: Why would I want to benchmark an empty, invariant loop in the first place?). — Damon, Jul 11 '16 at 15:00
Then update the code to compute something meaningful, function of some *unknown* input (e.g. `strtol(argv[1])`...) — Basile Starynkevitch, Jul 11 '16 at 16:35
If you sleep for 1000 microseconds, I'd expect the loop to take at least 1 millisecond. How are you measuring 0.4 ms? — Adrian McCarthy, Jul 11 '16 at 21:35
@AdrianMcCarthy: the `usleep` is [outside the timing window](http://stackoverflow.com/questions/38299023/why-does-this-delay-loop-start-to-run-faster-after-several-iterations-with-no-sl/38300014#comment64016653_38299023) — Peter Cordes, Jul 12 '16 at 22:43

Peter Cordes · Accepted Answer · 2019-11-03T10:08:56.020

After 26 iterations, Linux ramps the CPU up to the maximum clock speed since your process uses its full time slice a couple of times in a row.

If you checked with performance counters instead of wall-clock time, you'd see that the core clock cycles per delay-loop stayed constant, confirming that it's just an effect of DVFS (which all modern CPUs use to run at a more energy-efficient frequency and voltage most of the time).

If you tested on a Skylake with kernel support for the new power-management mode (where the hardware takes full control of the clock speed), ramp-up would happen much faster.

If you leave it running for a while on an Intel CPU with Turbo, you'll probably see the time per iteration increase again slightly once thermal limits require the clock speed to reduce back down to the maximum sustained frequency. (See Why can't my CPU maintain peak performance in HPC for more about Turbo letting the CPU run faster than it can sustain for high-power workloads.)

Introducing a usleep prevents Linux's CPU frequency governor from ramping up the clock speed, because the process isn't generating 100% load even at minimum frequency. (I.e. the kernel's heuristic decides that the CPU is running fast enough for the workload that's running on it.)

comments on other theories:

re: David's theory that a potential context switch from usleep could pollute caches: That's not a bad idea in general, but it doesn't help explain this code.

Cache / TLB pollution isn't important at all for this experiment. There's basically nothing inside the timing window that touches memory other than the end of the stack. Most of the time is spent in a tiny loop (1 line of instruction cache) that only touches one int of stack memory. Any potential cache pollution during usleep is a tiny fraction of the time for this code (real code will be different)!

In more detail for x86:

The call to clock() itself might cache-miss, but a code-fetch cache miss delays the starting-time measurement, rather than being part of what's measured. The second call to clock() will almost never be delayed, because it should still be hot in cache.

The run function may be in a different cache line from main (since gcc marks main as "cold", so it gets optimized less and placed with other cold functions/data). We can expect one or two instruction-cache misses. They're probably still in the same 4k page, though, so main will have triggered the potential TLB miss before entering the timed region of the program.

gcc -O0 will compile the OP's code to something like this (Godbolt Compiler explorer): keeping the loop counter in memory on the stack.

The empty loop keeps the loop counter in stack memory, so on a typical Intel x86 CPU the loop runs at one iteration per ~6 cycles on the OP's IvyBridge CPU, thanks to the store-forwarding latency that's part of add with a memory destination (read-modify-write). 100k iterations * 6 cycles/iteration is 600k cycles, which dominates the contribution of at most a couple cache misses (~200 cycles each for code-fetch misses which prevent further instructions from issuing until they're resolved).

Out-of-order execution and store-forwarding should mostly hide the potential cache miss on accessing the stack (as part of the call instruction).

Even if the loop-counter was kept in a register, 100k cycles is a lot.

I increase the value of `N` by 100x and use `cpufreq-info` command, I found the cpu is still work on the minimum frequency when the code is running. — phyxnj, Jul 11 '16 at 07:17
@phyxnj: with `usleep` un-commented? It ramps up for me with N=10000000. (I use `grep MHz /proc/cpuinfo` since I never got around to installing cpufreq-utils on this machine). Actually I just found that `cpupower frequency-info` shows what cpufreq-info did, for one core. — Peter Cordes, Jul 11 '16 at 07:30
@phyxnj: are you sure you're looking at all cores, and not just one core? `cpupower` seems to default to just core 0. — Peter Cordes, Jul 11 '16 at 07:34
`grep MHz /proc/cpuinfo` shows cpu frequency increase indeed. `cpufreq-info` maybe monitor a random core of cpu. I think you are right, maybe this is the cause of the problem. — phyxnj, Jul 11 '16 at 07:42
@phyxnj: It's not random, and the core number is printed out in the output. e.g. http://www.thinkwiki.org/wiki/How_to_use_cpufrequtils. It almost certainly defaults to core 0 only. The only thing that's unpredictable is which core your process will run on. — Peter Cordes, Jul 11 '16 at 08:33

score 3 · Answer 2 · answered Jul 11 '16 at 04:12

3

A call to usleep may or may not result in a context switch. If it does, it will take longer than if it doesn't.

answered Jul 11 '16 at 04:12

David Schwartz

179,497
17
214
278

1

`usleep` relinquishes the CPU voluntarily, so there should be a context switch for sure (even if the system is idle), isn't it? – rakib_ Jul 11 '16 at 04:49
1

@rakib Not if there's nothing to switch contexts to or if the time interval is too short. When you're talking about less than 10ms or so, the OS may decide not to do a context switch. – David Schwartz Jul 11 '16 at 04:59
@rakib: There's a switch to kernel mode and back for sure. There might *not* be a switch to another process before resuming the process that called `usleep`, so pollution of caches / TLBs / branch predictors may be minimal. – Peter Cordes Jul 11 '16 at 05:52
@David: `Linux` handles at nanosecond level of precision, so `ms` shouldn't be the issue. But what happens when a task relinquishes the CPU, at the time of wake up, tasks gets back into the run queue and it doesn't guarantee that task will be running immediately. Perhaps that might be a reason. – rakib_ Jul 11 '16 at 06:09
@rakib It has nothing to do with precision. It has to do with a binary decision whether or not to perform a context switch. It's a yes or no decision. – David Schwartz Jul 11 '16 at 06:10
@Peter: well if there's no runnable task the CPU will remain idle. Then what process that called `usleep` does when it doesn't switch? Spins over the CPU? And if there's no other process to run is there any chance of cache /TLB invalidation? – rakib_ Jul 11 '16 at 06:15
**Cache / TLB pollution isn't important at all for this experiment**. There's nothing inside the timing window other than the call to `clock()` itself that touches memory other than the stack, and `clock()` will be hot on the second call. Besides `clock`, most of the cycles (100k * ~6 cycles per iteration (store-forwarding latency) for a loop that keeps its counter in a local on the stack, on a typical x86) are spent in a busy-loop that only touches stack memory. I think gcc -O3 would optimize away the loop entirely, but 100k * 1 cycle is still a lot compared to one miss in clock. – Peter Cordes Jul 11 '16 at 06:17
@rakib It's up to the OS based on some complex decision making involving things like power management and what hardware timers are available on the platform. It may spin. It may use a hardware timer. Above 10ms, pretty much all modern linux platforms use hardware sleep. Below, it's hardware specific. It may depend on what other tasks read ready-to-run, it may depend on how much of the timeslice is left, it may depend on dynamic scheduler priorities (what this task has done in the recent past), and so on. – David Schwartz Jul 11 '16 at 06:18
@rakib: The process is asleep. I think you mean "what does the CPU that was running the process that called `usleep` do?" If the sleep is short enough, yes it may just spin in a [calibrated delay-loop](https://en.wikipedia.org/wiki/BogoMips). Otherwise the CPU will set a timer to wake itself up at the right time, then run a `HLT` or something to enter a power-saving sleep if there's nothing else to run, and no kernel housekeeping tasks to do. IDK Linux internals well enough to know how much work could be done, and how much cache pollution could occur. – Peter Cordes Jul 11 '16 at 06:30
I was trying to look (quick look) into the relevant Linux code, but didn't find any such things rather inserting a timer event for future wake-up and calls schedule(). Thats what I was trying to say. I know there are hardware/software based timers and also there's kernel config issues. But, the point of all sleep is to relinquish the CPU therefore a context switch. Just to back my point wrote a program which just `usleep(10)` and ran `perf stat` on it, output showed there's 1 context switch during it's run time (just an indication) – rakib_ Jul 11 '16 at 06:36
2

@rakib Then `schedule` figures out how long until the next thing it has to do and then decides how to wait, possibly scheduling something else, possibly using a hardware timer, possibly not. – David Schwartz Jul 11 '16 at 06:42
@rakib: Depending on terminology, a "context switch" can mean that another user-space thread/process runs on the current CPU, or just a transition from user mode to kernel mode (and back to the same process), like for a trivial system call like `getpid` or `gettimeofday` (Or a better example would be a syscall that doesn't have a user-space implementation exported in the VDSO, but making those syscalls is still possible with asm). The transition to kernel mode and back always happens, even if the kernel just did an `iret` or `sysreturn` right away (on x86 for example). – Peter Cordes Jul 11 '16 at 07:17
1

@rakib: If nothing significant runs on the CPU before it returns to the caller of `usleep`, some people would say there was no context switch, even if the hardware went to sleep (with a `hlt` instruction) for a short time. There's definitely minimal cache / TLB pollution in that case, and IIRC no TLB invalidation. (I forget exactly how the page tables for kernel mode work, but I don't think the whole TLB has to be blown away every system call). – Peter Cordes Jul 11 '16 at 07:24
@Peter Typically `context switch` indicates the change in the current execution context. That execution context might be running user space or kernel space code. To clarify the thing that whether context switch happen at time of calling usleep or not. I tried to point out that, yes it happens rather may or may not happen. – rakib_ Jul 11 '16 at 07:30
@rakib: If your definition of context switch include a transition from user mode to kernel mode, then obviously every system call including `usleep` causes two (on entry and on return). More may happen if another user-space task gets the CPU before returning. IDK what you're trying to say with "may not happen". – Peter Cordes Jul 11 '16 at 07:36
@Peter No, I don't mean that, if for some reason it appears to you, then I can't help here. By saying "may not happen" I meant, that is what David posted in his answer. – rakib_ Jul 11 '16 at 08:00

Why does this delay-loop start to run faster after several iterations with no sleep?

2 Answers2

Linked

Related