running multiple nodejs worker threads: why of such a large overhead/latency?

Question

I'm experimenting nodejs worker threads usage. I'm experiencing weird latency elapsed times.

I have a main.js that spawn a rapid sequence of worker threads.
Each worker.js exceute a cpu-boud computation (generate prime numbers);

BTW, the generatePrimes() javascipt function is is just a demo example of cpu-bound calculation. In my real case, the worker thread is a nodejs program that bind an c++ library (doing a speech recognition, during half a second elapsed with 100% CPU).

my PC laptop: Ubuntu 20.04.2 LTS desktop environment, has 8 cores:

$ inxi -C -M
Machine:   Type: Laptop System: HP product: HP Laptop 17-by1xxx v: Type1ProductConfigId serial: <superuser/root required> 
           Mobo: HP model: 8531 v: 17.16 serial: <superuser/root required> UEFI: Insyde v: F.32 date: 12/14/2018 
CPU:       Topology: Quad Core model: Intel Core i7-8565U bits: 64 type: MT MCP L2 cache: 8192 KiB Speed: 700 MHz min/max: 400/4600 MHz Core speeds (MHz): 1: 700 2: 700 3: 700 4: 700 5: 700 6: 700 7: 700 8: 700

$ echo "CPU threads: $(grep -c processor /proc/cpuinfo)"
CPU threads: 8

I experienced that computation has a total elapsed of ~8 seconds, when worker.js run independently (single thread) calling function: generatePrimes(2, 1e7)
QUESTION

When I spawn multiple threads, e.g. 6 threads, almost in parallel (see code here below), I was expected to have ~8 seconds again (maybe with a small overhead), independently from the number of spawned threads (don't they run in parallel, having enough cpu cores?). Instead I got a total elapsed more big of foreseen ~8 seconds. I summarize more than ~20 seconds ?! Why?

Here below source codes and some elapsed measurements using time/pidstat:

main.js

// main.js
const { Worker } = require('worker_threads')
     
  function runThread(workerData) {
   
    return new Promise((resolve, reject) => {
      
      const worker = new Worker('./worker.js', { workerData })
      
      worker.on('message', resolve)
      worker.on('error', reject)
      worker.on('exit', (code) => {
        if (code !== 0)
          reject(new Error(`Worker stopped with exit code ${code}`))
      })
   
    })
   
  }
   
     
  async function main() {
             
    const numThreads = + process.argv[2]
       
    if ( !numThreads || numThreads < 1 ) {
      console.error(`usage: ${process.argv[1]} number_of_threads`)
      process.exit()
    }  

    const min = 2      
    const max = 1e7    
 
    //
    // run multiple threads, in "parallel":  
    //    
    // It simulates a rapid spawn ("parallel") of a specific number of thread computation. 
    // The main thread run numThreads times the same worker thread.  
    //
    // Data results of each thread elaboration is just "done"
    //
   for (let i = 0; i < numeThreads; i++ )
     setImmediate( async () => { console.log( await runThread({min, max}) ) } )
  }     
   
  if (require.main === module)
    main()
   
  module.exports = { runThread }

worker.js

// worker.js
const { threadId, workerData, parentPort } = require('worker_threads')
const { generatePrimes } = require('./generatePrimes')

// take parameters from main/parente thread
const { min, max } = workerData

// synchronous long-running CPU-bound computation
const primes = generatePrimes(min, max)

// communicate result to main thread;
// to avoid any suspect that elapsed times depend on a large amount of data exchange (the primes array in this case),
// the returned data is just a short string. 
parentPort.postMessage( `Done. Thread id: ${threadId}` )

generatePrimes.js

// generatePrimes.js
// long running / CPU-bound calculation

function generatePrimes(start, range) {
  
  const primes = []
  let isPrime = true
  let end = start + range
  
  for (let i = start; i < end; i++) {
    for (let j = start; j < Math.sqrt(end); j++) {
      if (i !== j && i%j === 0) {
        isPrime = false
        break
      }
    }
    if (isPrime) {
      primes.push(i)
    }
    isPrime = true
  }

  return primes
}


function main() {

  const min = 2
  const max = 1e7

  console.log( generatePrimes(min, max) )

}  


if (require.main === module) 
  main()

module.exports = { generatePrimes }

TESTS

TEST 1: WITHOUT WORKER THREADS -> elapsed: ~8 seconds
TEST 2: SPAWN NR. 1 THREAD -> elapsed: ~8 seconds
TEST 3: SPAWN NR. 6 THREADS -> elapsed: ~21 seconds

TEST 1: WITHOUT WORKER THREADS

generatePrimes.js standalone -> elapsed: ~8 seconds

$ /usr/bin/time -f "%E" pidstat 1 -u -e node generatePrimes
Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)

09:19:05      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
09:19:06     1000    247776   98,02    0,00    0,00    0,00   98,02     5  node
09:19:07     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:08     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:09     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:10     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:11     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:12     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
09:19:13     1000    247776  100,00    0,00    0,00    0,00  100,00     5  node
[
    2,   3,   5,   7,  11,  13,  17,  19,  23,  29,  31,  37,
   41,  43,  47,  53,  59,  61,  67,  71,  73,  79,  83,  89,
   97, 101, 103, 107, 109, 113, 127, 131, 137, 139, 149, 151,
  157, 163, 167, 173, 179, 181, 191, 193, 197, 199, 211, 223,
  227, 229, 233, 239, 241, 251, 257, 263, 269, 271, 277, 281,
  283, 293, 307, 311, 313, 317, 331, 337, 347, 349, 353, 359,
  367, 373, 379, 383, 389, 397, 401, 409, 419, 421, 431, 433,
  439, 443, 449, 457, 461, 463, 467, 479, 487, 491, 499, 503,
  509, 521, 523, 541,
  ... 664479 more items
]

Average:     1000    247776   99,75    0,00    0,00    0,00   99,75     -  node
0:08.60

TEST 2: SPAWN NR. 1 THREAD

main.js spawn nr. 1 thread -> elapsed: ~8 seconds (again)

$ /usr/bin/time -f "%E" pidstat 1 -u -e node main 1
Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)

your machine has 8 cores.


09:21:01      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
09:21:02     1000    247867   95,00    2,00    0,00    0,00   97,00     3  node
09:21:03     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:04     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:05     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:06     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:07     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:08     1000    247867  100,00    0,00    0,00    0,00  100,00     3  node
09:21:09     1000    247867  100,00    1,00    0,00    0,00  101,00     3  node
Done. Thread id: 1

Average:     1000    247867   99,38    0,38    0,00    0,00   99,75     -  node
0:08.50

TEST 3: SPAWN NR. 6 THREADS

multiple (6) threads. -> elapsed: ~21 seconds (again)

$ /usr/bin/time -f "%E" pidstat 1 -u -e node main 6
Linux 5.8.0-50-generic (giorgio-HP-Laptop-17-by1xxx)    22/04/2021  _x86_64_    (8 CPU)

your machine has 8 cores.


09:23:38      UID       PID    %usr %system  %guest   %wait    %CPU   CPU  Command
09:23:39     1000    247946  554,00    1,00    0,00    0,00  555,00     0  node
09:23:40     1000    247946  599,00    1,00    0,00    0,00  600,00     0  node
09:23:41     1000    247946  600,00    1,00    0,00    0,00  601,00     0  node
09:23:42     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
09:23:43     1000    247946  599,00    1,00    0,00    0,00  600,00     0  node
09:23:44     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
09:23:45     1000    247946  600,00    0,00    0,00    0,00  600,00     0  node
09:23:46     1000    247946  599,00    2,00    0,00    0,00  601,00     0  node
09:23:47     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
09:23:48     1000    247946  599,00    0,00    0,00    0,00  599,00     0  node
09:23:49     1000    247946  600,00    1,00    0,00    0,00  601,00     0  node
09:23:50     1000    247946  598,00    1,00    0,00    0,00  599,00     0  node
09:23:51     1000    247946  599,00    2,00    0,00    0,00  601,00     0  node
Done. Thread id: 1
Done. Thread id: 4
09:23:52     1000    247946  430,00    0,00    0,00    0,00  430,00     0  node
09:23:53     1000    247946  398,00    0,00    0,00    0,00  398,00     0  node
09:23:54     1000    247946  399,00    1,00    0,00    0,00  400,00     0  node
09:23:55     1000    247946  398,00    0,00    0,00    0,00  398,00     0  node
09:23:56     1000    247946  399,00    0,00    0,00    0,00  399,00     0  node
09:23:57     1000    247946  396,00    3,00    0,00    0,00  399,00     0  node
09:23:58     1000    247946  399,00    0,00    0,00    0,00  399,00     0  node
Done. Thread id: 5
Done. Thread id: 6
09:23:59     1000    247946  399,00    1,00    0,00    0,00  400,00     7  node
Done. Thread id: 2
Done. Thread id: 3

Average:     1000    247946  522,00    0,71    0,00    0,00  522,71     -  node
0:21.05

Why I got ~20 seconds instead of expected ~8 seconds? Where I'm wrong?

UPDATE

I separated cpu-bound function generatePrimes in a separated module, just for clarity.
I added more elapsed tests, incrementing number of threads from 1 to 9. Tests show that elapsed time increase with the number of spawned threads. Thats makes no sense for me :(

$ /usr/bin/time -f "%E" node main 1

your machine has 8 cores.

Done. Thread id: 1
0:08.86
$ /usr/bin/time -f "%E" node main 2

your machine has 8 cores.

Done. Thread id: 2
Done. Thread id: 1
0:13.96
$ /usr/bin/time -f "%E" node main 3

your machine has 8 cores.

Done. Thread id: 2
Done. Thread id: 1
Done. Thread id: 3
0:16.71
$ /usr/bin/time -f "%E" node main 4

your machine has 8 cores.

Done. Thread id: 3
Done. Thread id: 2
Done. Thread id: 4
Done. Thread id: 1
0:21.87
$ /usr/bin/time -f "%E" node main 5

your machine has 8 cores.

Done. Thread id: 3
Done. Thread id: 2
Done. Thread id: 5
Done. Thread id: 1
Done. Thread id: 4
0:22.20
$ /usr/bin/time -f "%E" node main 6

your machine has 8 cores.

Done. Thread id: 3
Done. Thread id: 4
Done. Thread id: 6
Done. Thread id: 2
Done. Thread id: 5
Done. Thread id: 1
0:23.74
$ /usr/bin/time -f "%E" node main 7

your machine has 8 cores.

Done. Thread id: 3
Done. Thread id: 4
Done. Thread id: 7
Done. Thread id: 2
Done. Thread id: 5
Done. Thread id: 1
Done. Thread id: 6
0:32.00
$ /usr/bin/time -f "%E" node main 8

your machine has 8 cores.

Done. Thread id: 6
Done. Thread id: 3
Done. Thread id: 2
Done. Thread id: 5
Done. Thread id: 1
Done. Thread id: 8
Done. Thread id: 7
Done. Thread id: 4
0:35.92
$ /usr/bin/time -f "%E" node main 9

your machine has 8 cores.

warning: number of requested threads (9) is higher than number of available cores (8)
Done. Thread id: 8
Done. Thread id: 4
Done. Thread id: 6
Done. Thread id: 9
Done. Thread id: 2
Done. Thread id: 3
Done. Thread id: 7
Done. Thread id: 5
Done. Thread id: 1
0:40.27

BTW, related question: Why program execution time differs running the same program multiple times?

Those 4-core Intel chips only have four real cores. Intel marketroids claim 8 cores because each core can rapidly context switch between two different threads. But you CPU-intensive workload can't exploit that context-switch capability very well. Plus, if your signal processing workload is multithreaded, running multiple instances of it will definitely use up your CPU resources. — O. Jones, Apr 24 '21 at 12:11
Right, but even if I run just 4 threads, considering the master thread + 3 child threads ( `/usr/bin/time -f "%E" node main 3`) the overall elapsed time is >11 seconds, instead if I run a single thread I got ~8 seconds. This seems to demonstrate that the issue is not related to hyper threading. Isn't it? — Giorgio Robino, Apr 24 '21 at 12:18
Please don't create new tags that doesn't add important information to your question and will possibly be used for off-topic questions. All the relevant information of your hardware is already in question. This tag is useless — Tomerikoo, Apr 26 '21 at 09:03
@Tomerikoo I partially agree, because immo the "laptop" tag you removed, is probably the key of the solution of the detailed problem. That's is to be solved/verified in possible answers. — Giorgio Robino, Apr 26 '21 at 10:24
What if you had the exact same specification on a desktop? The fact that you're using a laptop is not really relevant to the question and as a tag it simply has no place here and will only invite many off-topic questions (like "why is my laptop not turning on?") — Tomerikoo, Apr 26 '21 at 10:27
I disagree because I guess the problem is maybe related to laptop matherboards cpu power saving firmware logics. As soon possible I'll try and report, testing on a desktop/virtual machine. — Giorgio Robino, Apr 26 '21 at 10:42
@Giorgio Robino I'm facing the same issue, did u figure out the reason for this? — Youssef Mohamed, Jan 21 '22 at 03:17

O. Jones · Answer 1 · 2021-04-24T13:43:42.503

You're running up against a practical limit known as Ahmdahl's Law. If you have twice the processors you ordinarily don't get twice the computing throughput. Why not?

Several reasons, typically very hard to tease out and measure separately, including:

It takes time to spin up and tear down parallel processes (Javascript workers in this case).
The processors have to coordinate among each other, which takes context-switching and interprocessor communication.
The processors share resources such as bus, RAM, and SSD/hard drive access. One processor must sometimes wait while another uses those resources.
(I'm not completely sure about this): It takes the V8 Javascript engine some time in each worker to detect and then just-in-time compile the "hot path" in the code. While that's happening the worker's operations are memory intensive. Afterward, code like yours can probably run mostly in processor registers, which do come close to thwarting Ahmdahl's law.
There's a lot of stuff running behind the scenes on a laptop.

And, laptops don't have the same power-intensive RAM and bus structures of monster multicore servers, so the contention is worse. They're designed more for the desktop use cases, where various user interface processes share the cores and hyperthreads.

If you have one worker thread per real core, your main nodejs process has to share with your threads too. As does your Xorg server, your file system, and all those dozens of really useful daemon processes on Focal Fossa.

If this is a critical issue for your capacity planning, spend a few tens of euros / dollars and rent a big fat 24- or 32-core server on one of the cloud vendors to run your experiments. With your real signal-processing workload. That's a more useful test. If you rent the most cores they offer, you'll likely get the whole machine and not share it with other customers.

Don't waste your time trying to understand the money- and power- saving shortcuts and sleazy hardware hacks in your laptop motherboard.

(This old-timer once worked in field software support for a computer company. I had to explain Ahmdahl's law to the sales people, over and over, so they wouldn't oversell the company's ridiculously expensive new parallel-processing products. They still did oversell them. It took some big customers demanding their money back to teach them.)

thx @O. Jones for detailed answer. 2. 3. true but please note that thread elaboration(prime numbers calculations) is "pure" cpu-bound calculation. No I/O except the returned big array, but experimented (see later) that data impact is minimal in terms of elpased times. 1. 5. true, I have to investigate how much time V8 require to spin up, but these times can NOT be many seconds. — Giorgio Robino, Apr 24 '21 at 15:31
I agree with your suggestion to run on a virtual server or bare metal host! I tested on a VM w 2 cores, experimenting pretty CONSTANT elapsed times! See also my related question: https://stackoverflow.com/questions/67211241/why-program-execution-time-differs-running-the-same-program-multiple-times My suspect is that the issue arise because my sad HP laptop matherboard CPU power saving settings that seem to avoid temperature raising or something (no BIOS setting possible too). That behavior, if confirmed, is for me really bad, because, on demand, I want to use all my laptop cpu power! — Giorgio Robino, Apr 24 '21 at 15:36
Yeah, well, laptop vendors exaggerate the capabilities of their products all. the. time. That's why gamers buy ridiculous desktops with water-cooling and all that sort of thing. If you can loadbalance this stuff between machines, a few cheap raspberry pi machines will probably be very fast with one worker each. We could buy 20,000 of them for the refund we gave one oversold customer back in the day. — O. Jones, Apr 24 '21 at 15:46
Well, my final application need is to use nodejs worker threads to wrap a cpu-bound speech recognition engine transcript, to build a multiuser architecture. The target host will be preferably a multicore linux server virtual machine. nevertheless what make me crazy is the fact I have the described behavior on my "8 cores"... laptop :( — Giorgio Robino, Apr 24 '21 at 16:04

running multiple nodejs worker threads: why of such a large overhead/latency?

1 Answers1

Linked