12

This simple C code first creates an array of 0xFFFFFF elements and then passes it two times, measuring how much time each pass takes:

#include <ctype.h>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

#define TESTSIZ 0xffffff

char testcases[TESTSIZ];

void gentestcases(void)
{
        size_t i = 0;
        while(i < TESTSIZ)
                testcases[i++] = rand()%128;

        return;
}

long long time_elapsed(struct timespec beg, struct timespec end)
{
        if(end.tv_nsec < beg.tv_nsec) {
                end.tv_nsec += 1000000000;
                end.tv_sec--;
        }

        return 1000000000ll*(end.tv_sec-beg.tv_sec) + end.tv_nsec-beg.tv_nsec;
}

long long test( int(*func)(int) )
{
        struct timespec beg, end;

        clock_gettime(CLOCK_MONOTONIC, &beg);

        int volatile sink;
        size_t i = 0;
        while(i < TESTSIZ)
                sink = islower(testcases[i++]);

        clock_gettime(CLOCK_MONOTONIC, &end);

        return time_elapsed(beg, end);
}

int main()
{
        gentestcases();

        struct timespec beg, end;

        printf("1st pass took %lld nsecs\n", test(islower));
        printf("2nd pass took %lld nsecs\n", test(islower));
}

I compile it with gcc -O2 -std=gnu89 -o sb sillybench.c

Usually I get the result that processing the array for the second time is slower. The effect is small but noticeable (1-3 ms) and - with a single exception - repetitive:

m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13098789 nsecs
2nd pass took 13114677 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13052105 nsecs
2nd pass took 13134187 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13118069 nsecs
2nd pass took 13074199 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13038579 nsecs
2nd pass took 13079995 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13070334 nsecs
2nd pass took 13324378 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13031000 nsecs
2nd pass took 13167349 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13019961 nsecs
2nd pass took 13310211 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13041332 nsecs
2nd pass took 13311737 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13030913 nsecs
2nd pass took 13177423 nsecs
m@m-X555LJ ~/UVA/fastIO $ ./sb
1st pass took 13060570 nsecs
2nd pass took 13387024 nsecs

Why is it the case? If anything, I'd suppose that processing an array for the first time should be slower, not for the second time!

If this matters:

m@m-X555LJ ~/UVA/fastIO $ gcc --version
gcc (Ubuntu 5.4.0-6ubuntu1~16.04.4) 5.4.0 20160609
Copyright (C) 2015 Free Software Foundation, Inc.
This is free software; see the source for copying conditions.  There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

System:    Host: m-X555LJ Kernel: 4.4.0-21-generic x86_64 (64 bit gcc: 5.3.1)
           Desktop: Cinnamon 3.0.7 (Gtk 2.24.30) Distro: Linux Mint 18 Sarah

CPU:       Dual core Intel Core i5-5200U (-HT-MCP-) cache: 3072 KB
           flags: (lm nx sse sse2 sse3 sse4_1 sse4_2 ssse3 vmx) bmips: 8786
           clock speeds: max: 2700 MHz 1: 2200 MHz 2: 2202 MHz 3: 2200 MHz
           4: 2200 MHz
  • have you tried to set random seed before executions? also: try to replace `rand()` by a constant or `i` – Jean-François Fabre Jun 12 '17 at 15:53
  • 1
    Interesting note, I cut/pasted your code and got your same results. But when I removed the function pointer argument to `test`, since you aren't using it, I got the results you are expecting, i.e. the second run is faster. – jiveturkey Jun 12 '17 at 15:57
  • 2
    Interesting. I'm testing with Visual studio and it's inlining almost exactly the same assembly for both passes. I would guess that the difference is due to some cache magic. – jforberg Jun 12 '17 at 15:58
  • `islower` is hardcoded in your test BTW. – Jean-François Fabre Jun 12 '17 at 15:58
  • @jiveturkey Weird, I removed the function pointer and was still getting same results. –  Jun 12 '17 at 15:59
  • Intel Core i7 3.4GHz on Linux 4.8.0 – jiveturkey Jun 12 '17 at 16:01
  • @Jean-FrançoisFabre I’m sorry for that. I meant `sink = func(testcases[i++]);` of course, I hardcoded this `islower` by mistake. Interesting: When I replaced `sink = islower(testcases[i++]);` with `sink = func(testcases[i++]);` the whole program run 5 times slower and the difference between the two passes ballooned to 10ms! –  Jun 12 '17 at 16:03
  • probably because the compiler could not inline `islower` when passing it as a function pointer. – Jean-François Fabre Jun 12 '17 at 16:04
  • @Jean-FrançoisFabre Sure thing, but I'd expect it to at least inline `test` –  Jun 12 '17 at 16:05
  • 1
    Is there any difference if you compile with `-std=gnu99` or `-std=gnu11`? Have you tried a third run? – Bob__ Jun 12 '17 at 16:05
  • @Bob__ Can't see much difference between `-std=gnu89` and `-std=gnu11`. –  Jun 12 '17 at 16:07
  • @gaazkam Visual studio inlines test() for me but not GCC. It's perhaps not the ideal candidate for inlining since it's only called twice. Inlining would allow the compiler to constant call tolower() instead of calling it from a register pointer. If on GCC, you could try __attribute__((always_inline)) to force GCC to inline test(). Just to see what happens – jforberg Jun 12 '17 at 16:07
  • @Bob__ A third pass runs even slower! And the difference between the third run and the second run is 3-4x greater than the difference between the second run and the first run! –  Jun 12 '17 at 16:09
  • @gaazkam have you tried to replace `rand()` by just `i` ? – Jean-François Fabre Jun 12 '17 at 16:10
  • @Marian Having removed `sink = islower(testcases[i++])` with `sink = testcases[i++] + 1;` the effect is more slight, but sill noticeable. However, the third pass may sometimes run as fast as the first one then. –  Jun 12 '17 at 16:12
  • @Jean-FrançoisFabre Adding an initial `srand(time(NULL));` doesn't seem to change much. –  Jun 12 '17 at 16:15
  • `srand(0)` would be better, since `time(NULL)` is different each time... – Jean-François Fabre Jun 12 '17 at 16:15
  • but I was thinking (more radically): `testcases[i++] = i%128;` – Jean-François Fabre Jun 12 '17 at 16:16
  • @Jean-FrançoisFabre But You were asking for a random seed!! –  Jun 12 '17 at 16:16
  • I meant: the same seed for `random` :) and resetting it between calls of course – Jean-François Fabre Jun 12 '17 at 16:17
  • @Jean-FrançoisFabre As for `testcases[i++] = i%128;` – this seems to make the first pass randomly run faster or slower than the second one. Well I wanted to avoid this because of this: https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array –  Jun 12 '17 at 16:18
  • @Jean-FrançoisFabre As You've asked I tried regenerating the table after reseeding `rand` with the same seed each pass. Effect is the same. –  Jun 12 '17 at 16:21
  • 1
    "problem" is certainly not related to `rand` usage as measure (`test` function) do not use it... It more like a resolution measuring problem and/or OS paging problem. – Jean-Baptiste Yunès Jun 12 '17 at 17:06
  • 1
    I could be wrong, but could it be that new processes are given a slightly​ higher priority (and thus more CPU time) for the first few seconds? – Daniel Jour Jun 12 '17 at 18:09
  • 4
    1) On a busy system: Scheduling. After the first loop, the process has used up time allotted to it by the OS. It is scheduled less during the execution of the second loop. 2) On an idle system: Thermal effects. During the first loop, the process causes the CPU to heat up, therefore reducing the maximum possible CPU frequency during the second loop. – EOF Jun 12 '17 at 18:25
  • @EOF thermal effect : code executes very fast, while it could be a reasonable cause in general I doubt that could happens here, but I'm not sure. Are CPU heating so fast? Scheduling seems to be much more the problem, I tried with other process perturbations and I have been able to observe measure problems. – Jean-Baptiste Yunès Jun 12 '17 at 18:41
  • 1
    @Jean-BaptisteYunès I actually find the thermal effects the most plausible explanation; I somewhat doubt the OP benchmarked on a busy system. And CPU's *do* heat up fast when running above their maximum nominal frequency (using something like TurboBoost). Consider that power dissipation scales to something like the third power of frequency, and heat capacity of silicon is quite limited. Also, both of the OP's tests together take ~26 milliseconds, which is an eternity to a modern CPU. Given that the difference is only ~1-3%, the thermal throttling could be starting *during* the second loop. – EOF Jun 12 '17 at 19:09
  • 1
    @EOF Well I was benchmarking this on the very same laptop that had Firefox with this SO question opened in another window, as well as several background processes running (it's amazing how much `ps ax` shows on a seemingly idle system...) Still I suppose this doesn't count as a "busy system". As for your thermal explanation: I tried testing this with 5 runs. The slowest run was the third one if I recall. The fifth one was noticeably faster than the third one. I suppose the heat should rather be accumulating? –  Jun 12 '17 at 19:19
  • @gaazkam: What kind of timescales are we talking about? As I said, a run seems to take ~26 milliseconds, which (I believe) is enough to raise the heat sufficiently to cause thermal throttling. On the other hand, starting multiple consecutive test runs will probably take longer (since you, as a human, trigger them), and may leave enough time for the CPU to cool off. Did you write a script that triggers the runs one immediately after the other? Did you write a script that runs them in parallel? – EOF Jun 12 '17 at 19:31
  • In the code I posted in the question I wrote: `printf("1st pass took %lld nsecs\n", test(islower)); printf("2nd pass took %lld nsecs\n", test(islower));` Well when I was talking about `5 runs` I meant I added `printf("3rd pass took %lld nsecs\n", test(islower)); printf("4th pass took %lld nsecs\n", test(islower)); printf("5th pass took %lld nsecs\n", test(islower));` –  Jun 12 '17 at 19:37
  • I turned up the load a notch (executed the test loop 10 times in a single pass, for a total of 400ms per pass on my machine) and ran two batches of 10 consecutive tests: one back-to-back, and one with 3s sleeps between runs. In the back-to-back batch, 1st or 2nd pass would be faster at random. In the second batch (with 3s sleeps between runs,) 2nd pass was consistently slower. This seems to confirm the thermal effects theory. – shinobi Jun 12 '17 at 19:55
  • If you want to exclude thermal effects, you might be able to disable SpeedStep/Turbo Boost in the BIOS. Note that a mobile part like the one in this question is particularly susceptible to thermal throttling, and particularly likely to be able to boost significantly above nominal frequency from idle. – EOF Jun 12 '17 at 20:01
  • @EOF and gaazkam I tried to launch the test in a loop (up to 20 times) and measures varies in no pattern. I also modified array size, sometimes second run was almost always slightly longer than the first, but with different sizes it is the contrary! Ok, thermal effect can lead to very strange things... – Jean-Baptiste Yunès Jun 12 '17 at 20:12

1 Answers1

5

This effect is most likely caused by turbo mode (or Intel Turbo Boost Technology). Turbo mode allows a processor core to run at higher than nominal clock frequency. One factor for that is temporal bursts*. Commonly for some fraction of a second, the processor will achieve the highest frequency. It is highly likely that the first loop is running at higher clock frequency than the second.

You can confirm that, by manually setting the nominal frequency (2.20 GHz for your processor), e.g. by using cpufrequtils or cpupower. However, on many systems use intel_pstate, which doesn't allow the userspace governor. Here's how you can disable turbo mode for intel_pstate - or disable intel_pstate all together.

Without turbo mode performance should be uniform.

*: Temperature is another factor, but i do doubt that it plays a role for 10 ms benchmark time. To illustrate, assume the CPU exceeds it's 15 W TDP and uses 20 W: Even a tiny 1 g of copper would only heat up by 0.5 K after 10 ms. I commonly see a short distinct burst (time, tens of milliseconds to few seconds) followed by a slow and steady decline (temperature, tens of seconds to minutes)

Note: gentestcases runs for a significant amount of time (e.g. 240 ms) before the first actual test contributing to the "sprint" of the processor.

Community
  • 1
  • 1
Zulan
  • 21,896
  • 6
  • 49
  • 109