2

I stressed my system to see how it affects some program i wrote using stress-ng.

The program itself is a neural network, mainly composed of some nested loops doing some multiplication and using about 1G of RAM overall coded in C++.

I imposed some memory stress on the system using:

stress-ng --vm 4 --vm-bytes 2G -t 100s

which creates 4 workers spinning on mmap allocating 2G of RAM each. This slows down the execution of my program significantly (from about 150ms to 250ms). But the reason for the program to slow down is not lack of memory or memory-bandwidth or something. Instead the CPU cycles decrease from 3.4GHz (without stress-ng) to 2.8GHz (with stress-ng). The CPU utilization stays about the same (99%), as expected.

I measured the CPU frequency using

sudo perf stat -B ./my_program

Does anybody know why memory stress slows down the CPU?

My CPU is an Intel(R) Core(TM) i5-8250U and my OS is Ubuntu 18.04.

kind regards lpolari

L Polari
  • 33
  • 5
  • Looking at the intel page 3.4GHz is your boost clock, so if you spawn more processes and the CPU throttles down due to temperatures than that would explain it no? – Borgleader Aug 13 '20 at 16:50
  • It's not clear to me when you say "slows down" compared to what? Also how do you know that the core frequency is the only reason or the biggest reason for performance degradation? What's execution time in terms of core clock cycles? – Hadi Brais Aug 13 '20 at 18:56

3 Answers3

6

Skylake-derived CPUs do lower their core clock speed when bottlenecked on load / stores, at energy vs. performance settings that favour more powersaving. Surprisingly, you can construct artificial cases where this downclocking happens even with stores that all hit in L1d cache, or loads from uninitialized memory (still CoW mapped to the same zero pages).

Skylake introduced full hardware control of CPU frequency (hardware P-state = HWP). https://unix.stackexchange.com/questions/439340/what-are-the-implications-of-setting-the-cpu-governor-to-performance Frequency decision can take into account internal performance-monitoring which can notice things like spending most cycles stalled, or what it's stalled on. I don't know what heuristic exactly Skylake uses.

You can repro this1 by looping over a large array without making any system calls. If it's large (or you stride through cache lines in an artificial test), perf stat ./a.out will show the average clock speed is lower than for normal CPU-bound loops.


In theory, if memory is totally not keeping up with the CPU, lowering the core clock speed (and holding memory controller constant) shouldn't hurt performance much. In practice, lowering the clock speed also lowers the uncore clock speed (ring bus + L3 cache), somewhat worsening memory latency and bandwidth as well.

Part of the latency of a cache miss is getting the request from the CPU core to the memory controller, and single-core bandwidth is limited by max concurrency (outstanding requests one core can track) / latency. Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?

e.g. my i7-6700k drops from 3.9GHz to 2.7GHz when running a microbenchmark that only bottlenecks on DRAM at default bootup settings. (Also it only goes up to 3.9GHz instead of 4.0 all-core or 4.2GHz with 1 or 2 cores active as configured in the BIOS, with the default balance_power EPP settings on boot or with balance_performance.)

This default doesn't seem very good, too conservative for "client" chips where a single core can nearly saturate DRAM bandwidth, but only at full clock speed. Or too aggressive about powersaving, if you look at it from the other POV, especially for chips like my desktop with a high TDP (95W) that can sustain full clock speed indefinitely even when running power-hungry stuff like x265 video encoding making heavy use of AVX2.

It might make more sense with a ULV 15W chip like your i5-8250U to try to leave more thermal / power headroom for when the CPU is doing something more interesting.


This is governed by their Energy / Performance Preference (EPP) setting. It happens fairly strongly at the default balance_power setting. It doesn't happen at all at full performance, and some quick benchmarks indicate that balance_performance also avoids this powersaving slowdown. I use balance_performance on my desktop.

"Client" (non-Xeon) chips before Ice Lake have all cores locked together so they run at the same clock speed (and will all run higher if even one of them is running something not memory bound, like a while(1) { _mm_pause(); } loop). But there's still an EPP setting for every logical core. I've always just changed the settings for all cores to keep them the same:

On Linux, reading the settings:

$ grep . /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference
/sys/devices/system/cpu/cpufreq/policy0/energy_performance_preference:balance_performance
/sys/devices/system/cpu/cpufreq/policy1/energy_performance_preference:balance_performance
...
/sys/devices/system/cpu/cpufreq/policy7/energy_performance_preference:balance_performance

Writing the settings:

sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;
 do echo balance_performance > "$i"; done'

See also


Footnote 1: experimental example:

Store 1 dword per cache line, advancing through contiguous cache lines until end of buffer, then wrapping the pointer back to the start. Repeat for a fixed number of stores, regardless of buffer size.

;; t=testloop; nasm -felf64 "$t.asm" && ld "$t.o" -o "$t" && taskset -c 3 perf stat -d -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread ./"$t"

;; nasm -felf64 testloop.asm
;; ld -o testloop testloop.o
;; taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop

; or idq.mite_uops 

default rel
%ifdef __YASM_VER__
;    CPU intelnop
;    CPU Conroe AMD
    CPU Skylake AMD
%else
%use smartalign
alignmode p6, 64
%endif

global _start
_start:

    lea        rdi, [buf]
    lea        rsi, [endbuf]
;    mov        rsi, qword endbuf           ; large buffer.  NASM / YASM can't actually handle a huge BSS and hit a failed assert (NASM) or make a binary that doesn't reserve enough BSS space.

    mov     ebp, 1000000000

align 64
.loop:
%if 0
      mov  eax, [rdi]              ; LOAD
      mov  eax, [rdi+64]
%else
      mov  [rdi], eax              ; STORE
      mov  [rdi+64], eax
%endif
    add  rdi, 128
    cmp  rdi, rsi
    jae  .wrap_ptr        ; normally falls through, total loop = 4 fused-domain uops
 .back:

    dec ebp
    jnz .loop
.end:

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

.wrap_ptr:
   lea  rdi, [buf]
   jmp  .back


section .bss
align 4096
;buf:    resb 2048*1024*1024 - 1024*1024     ; just under 2GiB so RIP-rel still works
buf:    resb 1024*1024 / 64     ; 16kiB = half of L1d

endbuf:
  resb 4096        ; spare space to allow overshoot

Test system: Arch GNU/Linux, kernel 5.7.6-arch1-1. (And NASM 2.14.02, ld from GNU Binutils 2.34.0).

  • CPU: i7-6700k Skylake
  • motherboard: Asus Z170 Pro Gaming, configured in BIOS for 1 or 2 core turbo = 4.2GHz, 3 or 4 core = 4.0GHz. But the default EPP setting on boot is balance_power, which only ever goes up to 3.9GHz. My boot script changes to balance_pwerformance which still only goes to 3.9GHz so fans stay quiet, but is less conservative.
  • DRAM: DDR4-2666 (irrelevant for this small test with no cache misses).

Hyperthreading is enabled, but the system is idle and the kernel won't schedule anything on the other logical core (the sibling of the one I pinned it to), so it has a physical core to itself.

However, this means perf is unwilling to use more programmable perf counters for one thread, so perf stat -d to monitor L1d loads and replacement, and L3 hit / miss would mean less accurate measuring for cycles and so on. It's negligible, like 424k L1-dcache-loads (probably in kernel page-fault handlers, interrupt handlers, and other overhead, because the loop has no loads). L1-dcache-load-misses is actually L1D.REPLACEMENT and is even lower, like 48k

I used a few perf events, including exe_activity.bound_on_stores -[Cycles where the Store Buffer was full and no outstanding load]. (See perf list for descriptions, and/or Intel's manuals for more).

EPP: balance_power: 2.7GHz downclock out of 3.9GHz

EPP setting: balance_power with sudo sh -c 'for i in /sys/devices/system/cpu/cpufreq/policy[0-9]*/energy_performance_preference;do echo balance_power > "$i";done'

There is throttling based on what the code is doing; with a pause loop on another core keeping clocks high, this would run faster on this code. Or with different instructions in the loop.

# sudo ... balance_power
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 

 Performance counter stats for './testloop':

            779.56 msec task-clock:u              #    1.000 CPUs utilized          
            779.56 msec task-clock                #    1.000 CPUs utilized          
                 3      context-switches          #    0.004 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 6      page-faults               #    0.008 K/sec                  
     2,104,778,670      cycles                    #    2.700 GHz                    
     2,008,110,142      branches                  # 2575.962 M/sec                  
     7,017,137,958      instructions              #    3.33  insn per cycle         
     5,217,161,206      uops_issued.any           # 6692.465 M/sec                  
     7,191,265,987      uops_executed.thread      # 9224.805 M/sec                  
       613,076,394      exe_activity.bound_on_stores #  786.442 M/sec                  

       0.779907034 seconds time elapsed

       0.779451000 seconds user
       0.000000000 seconds sys

By chance, this happened to get exactly 2.7GHz. Usually there's some noise or startup overhead and it's a little lower. Note that 5217951928 front-end uops / 2106180524 cycles = ~2.48 average uops issued per cycle, out of a pipeline width of 4, so this is not low-throughput code. The instruction count is higher because of macro-fused compare/branch. (I could have unrolled more so even more of the instructions were stores, less add and branch, but I didn't.)

(I re-ran the perf stat command a couple times so the CPU wasn't just waking from low-power sleep at the start of the timed interval. There are still page faults in the interval, but 6 page faults are negligible over a 3/4 second benchmark.)

balance_performance: full 3.9GHz, top speed for this EPP

No throttling based on what the code is doing.

# sudo ... balance_performance
$ taskset -c 3 perf stat -etask-clock:u,task-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,exe_activity.bound_on_stores -r1 ./"$t" 

 Performance counter stats for './testloop':

            539.83 msec task-clock:u              #    0.999 CPUs utilized          
            539.83 msec task-clock                #    0.999 CPUs utilized          
                 3      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 6      page-faults               #    0.011 K/sec                  
     2,105,328,671      cycles                    #    3.900 GHz                    
     2,008,030,096      branches                  # 3719.713 M/sec                  
     7,016,729,050      instructions              #    3.33  insn per cycle         
     5,217,686,004      uops_issued.any           # 9665.340 M/sec                  
     7,192,389,444      uops_executed.thread      # 13323.318 M/sec                 
       626,115,041      exe_activity.bound_on_stores # 1159.827 M/sec                  

       0.540108507 seconds time elapsed

       0.539877000 seconds user
       0.000000000 seconds sys

About the same on a clock-for-clock basis, although slightly more total cycles where the store buffer was full. (That's between the core and L1d cache, not off core, so we'd expect about the same for the loop itself. Using -r10 to repeat 10 times, that number is stable +- 0.01% across runs.)

performance: 4.2GHz, full turbo to the highest configured freq

No throttling based on what the code is doing.

# sudo ... performance
taskset -c 3 perf stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread -r1 ./testloop

 Performance counter stats for './testloop':

            500.95 msec task-clock:u              #    1.000 CPUs utilized          
            500.95 msec task-clock                #    1.000 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                 7      page-faults               #    0.014 K/sec                  
     2,098,112,999      cycles                    #    4.188 GHz                    
     2,007,994,492      branches                  # 4008.380 M/sec                  
     7,016,551,461      instructions              #    3.34  insn per cycle         
     5,217,839,192      uops_issued.any           # 10415.906 M/sec                 
     7,192,116,174      uops_executed.thread      # 14356.978 M/sec                 
       624,662,664      exe_activity.bound_on_stores # 1246.958 M/sec                  

       0.501151045 seconds time elapsed

       0.501042000 seconds user
       0.000000000 seconds sys

Overall performance scales linearly with clock speed, so this is a ~1.5x speedup vs. balance_power. (1.44 for balance_performance which has the same 3.9GHz full clock speed.)

With buffers large enough to cause L1d or L2 cache misses, there's still a difference in core clock cycles.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • When the CPU does this kind of slowdown, doesn't something similar to the AVX* frequency license happen? I.e. the program triggering the slowdown is not affected but since power transitions are slow relative to code execution (including context switches), other programs may be affected (and also power management has some form of hysteresis). That's what may be happening to the OP neural network: its CPU bounded code is affected by the lower frequency. Nice answer BTW, I didn't know of this behavior. – Margaret Bloom Aug 14 '20 at 19:59
  • @MargaretBloom: ALU/latency-bound code on one core will still keep all the cores pegged at max frequency even if they're running memory-bound code. At least on a pre-Icelake "client" chip where all cores share a frequency. (I've only tested with one single-threaded memory-bound process and another single-threaded `pause` loop, not *all* other cores running memory bound code, though.) Unlike AVX turbo licences, it's purely a power-saving heuristic, not an upper limit on how fast a core is willing to let itself run in a situation. – Peter Cordes Aug 14 '20 at 20:04
  • 1
    "However, this means perf is unwilling to use more programmable perf counters for one thread" - I'm pretty sure perf is not at fault here: if HT is enabled in the BIOS, there are only 4 counters available per hardware thread, AFAIK enforced by the CPU, regardless of whether a second thread is running at the moment or anything like that. It's one of the few resources you actually lose if HT is enabled rather than simply not running at the moment. – BeeOnRope Aug 22 '20 at 22:38
  • 1
    Your first example running at 2.48 uops/cycle, yet still downclocking, is quite interesting. It's a bit surprising it downclocks then: I thought the heuristic they used was something along the lines of "stall cycles with requests outstanding" but here that should be basically zero as the IPC is high. Maybe there is an additional heuristic based on the store buffer occupancy or something? Kind of backfires when the stores are all hitting in L1 since this scales 100% with frequency. – BeeOnRope Aug 22 '20 at 23:16
  • @BeeOnRope: Yeah, I was expecting to come up with examples that showed it running fast with a small buffer, and only downclocking with a large buffer. This seems like a CPU performance bug in the choice of heuristics for downclocking. I think `exe_activity.bound_on_stores` being a lot lower than cycles shows that the store buffer is sometimes full, but only for a fraction of the total cycles, so it's really aggressive downclocking. – Peter Cordes Aug 22 '20 at 23:29
2

It's important to remember that modern CPUs, especially those made by Intel, have variable clock frequencies. The CPU will run slowly when lightly loaded to conserve power, which extends battery life, but can ramp up under load.

The limiting factor is thermals, that is the CPU will only be allowed to get so hot before the frequency is trimmed to reduce power consumption, and by extension, heat generation.

On a chip with more than one core, a single core can be run very quickly without hitting thermal throttling. Two cores must run slower, they're producing effectively twice the heat, and when using all four cores each has to share a smaller slice of the overall thermal budget.

It's worth checking your CPU temperature as the tests are running as it will likely be hitting some kind of cap.

tadman
  • 208,517
  • 23
  • 234
  • 262
  • I doubt this is thermal throttling; more likely it's Skylake's intentional slowing down on memory-intensive workloads at conservative EPP settings, like the default. – Peter Cordes Aug 13 '20 at 19:13
  • The first paragraph is potentially misleading because it seems to suggest when the number of active cores is smaller, the core frequency is also reduce. The number of active cores is only one factor that affects core frequency. Regarding the thermal limit, while you could be right, it's hard to say with high probability that this is reason in this case. There can be many reasons core freq throttling. The i5-8250U with proper cooling shouldn't hit a thermal limit within 250ms even if all of the 4 cores are active. We need to see the output of `turbostat`. – Hadi Brais Aug 13 '20 at 19:14
  • @HadiBrais That's why I'm suggesting monitoring the temperature will provide additional insight. – tadman Aug 13 '20 at 19:14
  • 1
    But the second paragraph confidently says that "the limiting factor is thermals" and that's it. I'm saying that this could be the reason, but not necessarily. Check the CPU temperature is not a bad idea, but it's better to see the output of `turbostat` which would directly tell us why core frequency throttling happened. – Hadi Brais Aug 13 '20 at 19:21
2

The last time I looked at this, it was enabling the "energy-efficient Turbo" setting that allowed the processor to do this. Roughly speaking, the hardware monitors the Instructions Per Cycle and refrains from continuing to increase the Turbo frequency if increased frequency does not result in adequate increased throughput. For the STREAM benchmark, the frequency typically dropped a few bins, but the performance was within 1% of the asymptotic performance.

I don't know if Intel has documented how the "Energy Efficient Turbo" setting interacts with all of the various flavors of "Energy-Performance Preference". In our production systems "Energy Efficient Turbo" is disabled in the BIOS, but it is sometimes enabled by default....

John D McCalpin
  • 2,106
  • 16
  • 19
  • This is on Xeon processors, right? Do they keep the uncore clock high when a core clock drops? On "client" chips, I think the uncore drops, too (unless you have another thread keeping all cores + uncore clocked high). IIRC, performance drops for a pure-load scan through memory (with an asm loop) were worse than 1% on i7-6700k Skylake (with hardware P-state). I forget exactly what I benchmarked, though, whether it was AVX, or strided scalar loads, or what. – Peter Cordes Aug 15 '20 at 16:55
  • Updated my answer with NASM test code, and results from i7-6700k (SKL client). An artificial test-case can reproduce the effect even when all stores hit in L1d cache, looping over a 16k buffer! So SKL isn't just checking IPC, because this happens at 3.33 IPC (2.48 uops / clock). Also, hardware P-states isn't just turbo, it's lowering the clock below the normal "stock" speed. – Peter Cordes Aug 15 '20 at 21:39
  • @PeterCordes My observations on "Energy Efficient Turbo" are from Xeon E5 processors (starting with v3). For high-bandwidth workloads the uncore frequency was automatically kept at the max, even if the cores slowed down. This is the right behavior for everything except single-threaded latency tests -- they need high frequency, but got low uncore frequency because the uncore traffic was so low. – John D McCalpin Aug 16 '20 at 19:57