2

Consider a nodejs cpu-bound program that generate primes numbers:

// generatePrimes.js
// long running / CPU-bound calculation

function generatePrimes(start, range) {
  
  const primes = []
  let isPrime = true
  let end = start + range
  
  for (let i = start; i < end; i++) {
    for (let j = start; j < Math.sqrt(end); j++) {
      if (i !== j && i%j === 0) {
        isPrime = false
        break
      }
    }
    if (isPrime) {
      primes.push(i)
    }
    isPrime = true
  }

  return primes
}


function main() {

  const min = 2
  const max = 1e7

  console.log( generatePrimes(min, max) )

}  


if (require.main === module) 
  main()

module.exports = { generatePrimes }

My HW/OS configuration: Laptop with Linux Ubuntu 20.04.2 LTS desktop environment, with 8 cores:

$ inxi -C -M
Machine:   Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required> 
           Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018 
CPU:       Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 700 2: 700 3: 700 4: 700 5: 700 6: 700 7: 700 8: 700     
$ echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
CPU threads: 8

Now, let measure the elapsed time:

$ /usr/bin/time -f "%e" node generatePrimes.js 
[
    2,   3,   5,   7,  11,  13,  17,  19,  23,  29,  31,  37,
   41,  43,  47,  53,  59,  61,  67,  71,  73,  79,  83,  89,
   97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
  157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223,
  227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
  283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359,
  367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
  439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
  509, 521, 523, 541,
  ... 664479 more items
]
7.99

OK, if i run once the program, the elapsed time is ~8 seconds.


But, consider the bash script testGeneratePrimes.sh to measure elapsed times, running the same program in sequence or in parallel 6 times:

#!/bin/bash 

# Get number of runs, as command line argument
if [ $# -eq 0 ]
  then
    echo
    echo -e "  run n processes, in sequence and in parallel."
    echo
    echo -e "  usage:"
    echo -e "    $0 <number of runs>"
    echo
    echo -e "  examples:"
    echo -e "    $0 6"
    echo -e "    run 6 times"
    echo
    exit 1
fi

numRuns=$1

# run a single instance of the process 'node generatePrimes'
runProcess() {
  /usr/bin/time -f "%e" node generatePrimes.js > /dev/null
}

echo
echo "SEQUENCE TEST: running generatePrimes, $numRuns successive sequential times" 
echo

for i in $(seq $numRuns); do
  runProcess
done

echo
echo "PARALLEL TEST: running generatePrimes, $numRuns in parallel (background processes)" 
echo

for i in $(seq $numRuns); do
  runProcess &
done

wait < <(jobs -p)

Running the script (6 processes):

$ ./testGeneratePrimes.sh 6

SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.16
9.09
11.44
11.57
12.93
12.00

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

25.99
26.16
30.51
30.64
31.60
31.60

I see that:

  • in sequence test, elapsed times increase for each run, from ~8 second to ~12 seconds ?!
  • in the "parallel" tests, elapsed increase from ~25 seconds to ~31 seconds ?!

Thats's insane.

I do not understand why! Maybe it's a linux scheduler limitation? A CPU hardware limitation/issue?

I also tried commands:

  • nice -20 /usr/bin/time -f "%e" node generatePrimes.js or
  • taskset -c 0-7 /usr/bin/time -f "%e" node generatePrimes.js

but without any significant difference in the described behavior.

Questions:

  1. Why elapsed times varying so much?
  2. There is any way to config the linux behaviour to do not limit per-process cpu usage?

BTW, related question: running multiple nodejs worker threads: why of such a large overhead/latency?

UPDATE (MORE TESTS)

Following Nate Eldredge suggestion (see comments), here some info/sets using cpupower and sensors:

$ sudo cpupower -c all info 
analyzing CPU 0:
perf-bias: 6
analyzing CPU 1:
perf-bias: 6
analyzing CPU 2:
perf-bias: 6
analyzing CPU 3:
perf-bias: 6
analyzing CPU 4:
perf-bias: 6
analyzing CPU 5:
perf-bias: 6
analyzing CPU 6:
perf-bias: 6
analyzing CPU 7:
perf-bias: 6

# set perf-bias to max performance
$ sudo cpupower set -b 0

$ sudo cpupower -c all info -b
analyzing CPU 0:
perf-bias: 0
analyzing CPU 1:
perf-bias: 0
analyzing CPU 2:
perf-bias: 0
analyzing CPU 3:
perf-bias: 0
analyzing CPU 4:
perf-bias: 0
analyzing CPU 5:
perf-bias: 0
analyzing CPU 6:
perf-bias: 0
analyzing CPU 7:
perf-bias: 0
$ sudo cpupower monitor
    | Nehalem                   || Mperf              || Idle_Stats                                                   
 CPU| C3   | C6   | PC3  | PC6   || C0   | Cx   | Freq  || POLL | C1   | C1E  | C3   | C6   | C7s  | C8   | C9   | C10   
   0|  0,04|  2,53|  0,00|  0,00||  4,23| 95,77|   688||  0,00|  0,00|  0,08|  0,10|  2,38|  0,00| 24,65|  0,78| 67,90
   4|  0,04|  2,53|  0,00|  0,00||  3,68| 96,32|   675||  0,00|  0,00|  0,02|  0,06|  0,76|  0,00| 10,83|  0,03| 84,75
   1|  0,03|  1,94|  0,00|  0,00||  6,39| 93,61|   656||  0,00|  0,00|  0,04|  0,06|  1,88|  0,00| 16,19|  0,00| 75,63
   5|  0,04|  1,94|  0,00|  0,00||  1,35| 98,65|   689||  0,00|  0,02|  1,19|  0,04|  0,40|  0,33|  3,89|  0,82| 92,02
   2|  0,56| 25,49|  0,00|  0,00|| 12,88| 87,12|   673||  0,00|  0,00|  0,84|  0,74| 28,61|  0,03| 34,48|  3,44| 19,81
   6|  0,56| 25,48|  0,00|  0,00||  4,30| 95,70|   676||  0,00|  0,00|  0,03|  0,09|  1,48|  0,00| 22,66|  1,11| 70,52
   3|  0,19|  3,61|  0,00|  0,00||  3,67| 96,33|   658||  0,00|  0,00|  0,02|  0,07|  1,36|  0,00| 14,85|  0,03| 80,16
   7|  0,19|  3,60|  0,00|  0,00||  6,21| 93,79|   679||  0,00|  0,00|  0,28|  0,19|  3,48|  0,76| 31,10|  1,50| 56,75

$ sudo cpupower monitor ./testGeneratePrimes.sh 6
[sudo] password for giorgio: 

SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.18
9.06
11.66
11.29
11.30
11.21

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

20.83
20.95
28.42
28.42
28.47
28.52
./testGeneratePrimes.sh took 91,26958 seconds and exited with status 0
    | Nehalem                   || Mperf              || Idle_Stats                                                   
 CPU| C3   | C6   | PC3  | PC6   || C0   | Cx   | Freq  || POLL | C1   | C1E  | C3   | C6   | C7s  | C8   | C9   | C10   
   0|  0,20|  1,98|  0,00|  0,00|| 33,88| 66,12|  2008||  0,00|  0,04|  0,18|  0,18|  1,83|  0,02| 16,67|  0,22| 47,04
   4|  0,20|  1,98|  0,00|  0,00|| 10,46| 89,54|  1787||  0,00|  0,09|  0,32|  0,37|  4,01|  0,03| 25,21|  0,19| 59,37
   1|  0,26|  2,40|  0,00|  0,00|| 24,52| 75,48|  1669||  0,00|  0,06|  0,18|  0,20|  2,17|  0,00| 14,90|  0,17| 57,84
   5|  0,26|  2,40|  0,00|  0,00|| 32,07| 67,93|  1662||  0,00|  0,07|  0,19|  0,14|  1,40|  0,02|  9,31|  0,53| 56,33
   2|  0,93| 13,33|  0,00|  0,00|| 31,31| 68,69|  2025||  0,00|  0,05|  0,43|  1,00| 18,21|  0,01| 26,18|  1,74| 21,23
   6|  0,93| 13,33|  0,00|  0,00|| 11,98| 88,02|  1711||  0,00|  0,19|  0,31|  0,22|  2,63|  0,03| 18,87|  0,76| 65,04
   3|  0,15|  0,98|  0,00|  0,00|| 47,38| 52,62|  2627||  0,00|  0,07|  0,17|  0,13|  1,35|  0,01|  7,88|  0,10| 42,80
   7|  0,15|  0,98|  0,00|  0,00|| 59,25| 40,75|  2235||  0,00|  0,06|  0,18|  0,18|  1,58|  0,00|  9,31|  0,63| 28,91


$ sensors && testGeneratePrimes.sh 6 && sensors
coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +46.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +43.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +45.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +44.0°C  (high = +100.0°C, crit = +100.0°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:          12.89 V  
curr1:         0.00 A  

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       65.49 V  
edge:        +511.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:        1.07 kW (cap =  30.00 W)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +42.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +47.0°C  (crit = +120.0°C)
temp2:        +53.0°C  (crit = +127.0°C)


SEQUENCE TEST: running generatePrimes, 6 successive sequential times

8.36
9.76
11.35
11.38
11.22
11.24

PARALLEL TEST: running generatePrimes, 6 in parallel (background processes)

21.06
21.14
28.50
28.55
28.62
28.65

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +54.0°C  (high = +100.0°C, crit = +100.0°C)
Core 0:        +51.0°C  (high = +100.0°C, crit = +100.0°C)
Core 1:        +50.0°C  (high = +100.0°C, crit = +100.0°C)
Core 2:        +54.0°C  (high = +100.0°C, crit = +100.0°C)
Core 3:        +50.0°C  (high = +100.0°C, crit = +100.0°C)

BAT0-acpi-0
Adapter: ACPI interface
in0:          12.89 V  
curr1:         0.00 A  

amdgpu-pci-0100
Adapter: PCI adapter
vddgfx:       65.49 V  
edge:        +511.0°C  (crit = +104000.0°C, hyst = -273.1°C)
power1:        1.07 kW (cap =  30.00 W)

pch_cannonlake-virtual-0
Adapter: Virtual device
temp1:        +46.0°C  

acpitz-acpi-0
Adapter: ACPI interface
temp1:        +55.0°C  (crit = +120.0°C)
temp2:        +57.0°C  (crit = +127.0°C)

Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Giorgio Robino
  • 2,148
  • 6
  • 38
  • 59
  • I can't replicate those results (unless I switch away from that window and start using the browser, for example): 10.64 10.68 10.42 10.51 10.59 10.46 10.48 10.65 10.49 10.47 The variance in this run is IMHO marginal an can be easily explained by multi-user, multi-tasking OS ... Mind you, I also redirected the output of the numbers to `/dev/null`, so another possibility is that the terminal window/scroll-buffer handling makes the difference. – tink Apr 22 '21 at 16:17
  • Yes, your values are reasonable/expected. In my case instead, I'm starting to think the problem is not in software or the operating system, but it's maybe related to hardware: I fair is a CPU BIOS "power saving" settings that does not enable full core usage. Does it make sense? Unfortunately entering the BIOS config, I can't tune any related parameter. – Giorgio Robino Apr 22 '21 at 16:35
  • Just out of curiosity: what does the timing on your machine do if you **don't** actually output the numbers to screen? `/usr/bin/time -f "%e" nodejs generatePrimes.js >/dev/null` – tink Apr 22 '21 at 16:38
  • Tried. The difference is minimal/imperceptible, eg. 8.35 with redirection to /dev/null ?! and 8.30 when maintaining stdout print. I think stdio doesn't make a real difference. – Giorgio Robino Apr 22 '21 at 17:45
  • Even if you do it for all instances in the script? For a single execution screen handling/scroll-back buffers might be negligible ... – tink Apr 22 '21 at 18:08
  • no problem. I modified the script to avoid stdout. I reported updated test results – Giorgio Robino Apr 22 '21 at 18:30
  • Cool - and pity that my suggestion didn't have the desired impact. Back to the drawing board. But I still think it's not Linux per se doing this. – tink Apr 22 '21 at 18:45
  • Yes. I temporarly agree that is not a linux problem. I just tested the script on a virtual machine (with 2VCPU) and i got pretty the same CONSTANT elapsed times (~15 secs) in all cases! That let me think about an HW limitation of my current HP laptop ( :( ), probably due to some Power Saving setting, unfortunately I can't tune them on BIOS... – Giorgio Robino Apr 22 '21 at 18:48
  • 1
    Uh-huh ... funny little aside; I just ran the test on my workstation at work, too, and the results are almost identical to the one at home. `10.94 10.93 10.93 10.91 10.90 10.91 10.92 10.91 10.91 10.92` ... despite completely different CPU types. At home I have a 8-core `Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz`; the work machine is 6-core `Intel(R) Xeon(R) W-2133 CPU @ 3.60GHz` – tink Apr 22 '21 at 18:55
  • On my end, everything run at around 7s ```$ sudo inxi -C -M Machine: Type: Desktop System: Gigabyte product: Z370P D3 v: N/A serial: N/A Mobo: Gigabyte model: Z370P D3-CF v: x.x serial: N/A UEFI [Legacy]: American Megatrends v: F14 date: 08/13/2019 CPU: Info: 8-Core model: Intel Core i7-9700K bits: 64 type: MCP L2 cache: 12 MiB Speed: 800 MHz min/max: 800/4900 MHz Core speeds (MHz): 1: 800 2: 800 3: 800 4: 800 5: 800 6: 800 7: 800 8: 800 node --version v12.18.3 ``` – Luis Lobo Borobia Apr 22 '21 at 19:01
  • thx Luis, tink. Are you both running last tests on a desktop machine. right? Your tests confirm evidences that the issue is probably related to some CPU "power saving" settings of my laptop. But I can't see how to avoid these power saving :( – Giorgio Robino Apr 22 '21 at 19:07
  • Can you log CPU frequency and temperature several times a second while the test runs? One possibility is that after the first few seconds of running, the CPU gets too hot and has to reduce the clock speed to cool down. – Nate Eldredge Apr 22 '21 at 19:07
  • hmmm makes sense. Do you know how can I measure, on linux cli, cpus frequency/temperature when program is running? Anyway, if your thesis is true, I guess there is no solution. But that's weird. I have 8 cores but I can really use them... – Giorgio Robino Apr 22 '21 at 19:09
  • @GiorgioRobino: For temperature, the `lm-sensors` package. For CPU frequency, `/proc/cpuinfo` or `cpupower`. – Nate Eldredge Apr 22 '21 at 21:02
  • @NateEldredge thx. I updated the question adding some tests using cpupower and sensors commands. Using cpupower I set max performance, nevertheless I didn't get expected improvement :( – Giorgio Robino Apr 23 '21 at 08:01
  • Oh! Your machine has not 8 cores but 4. So presumably you have 8 logical processors due to [hyperthreading](https://stackoverflow.com/questions/680684/multi-cpu-multi-core-and-hyper-thread). This would explain the parallel numbers pretty well: two of the jobs get their own cores so they are fast. The other four are run on the other two cores, two to a core, and are slower because they have to share their resources. Doesn't help explain the sequential test though. – Nate Eldredge Apr 23 '21 at 13:57
  • As far as I know, from the programming perspective, we count the cores with by example: `echo "CPU threads: $(grep -c processor /proc/cpuinfo)"` So in my case I have 8 cores and the reason I run 6 processes (in parallel) is just to avoid to "saturate" the number of cores, possibly avoiding any OS enqueue/saturation limit. A common sense/safe rule is to limit parallel multithreading to num_cores -1 (reasonable but maybe not a gold rule). I'm wrong? – Giorgio Robino Apr 23 '21 at 16:05
  • The point of hyperthreading is that you *can* program as if you truly have 8 physical cores, but you won't get the same performance as if they were all truly independent cores. It's a compromise - as you can see, the jobs that are shared on a physical core run slower, but not twice as slow (which is what you would see without hyperthreading if they had to be scheduled in alternation). – Nate Eldredge Apr 23 '21 at 16:15
  • `/proc/cpuinfo` lists *logical* processors; you can use the other fields to see how they are distributed across physical cores. See https://unix.stackexchange.com/questions/57920/how-do-i-know-which-processors-are-physical-cores – Nate Eldredge Apr 23 '21 at 16:15
  • Ok, I have 4 real cores. Now if in `testPrimeNumbers.sh` I limit the number of processes (threads) to 4, avoiding any hyperthrading ambiguity, I still got: 1. sequential: 8.15 9.15 11.52 11.36 2. parallel: 20.06 20.12 20.14 20.21 I think nothing changed. I suspect a "CPU power saving" mechanism inhibits full core usage. – Giorgio Robino Apr 23 '21 at 16:55

0 Answers0