Node js architecture and performance

Question

I have a question about the architecture and performance of Node js.

I've done a bunch of reading on the topic (including on Stack Overflow), and I still have a couple of questions. I'd like to do 2 things:

Summarize what I've learned from crawling many different sources semi-concisely to see if my conclusions are correct.
Ask a couple questions about the threading and performance of Node that I haven't been able to pin down exact answers on from my research.

Node has a Single-Threaded, Asynchronous Event-Handling Architecture

Single-Threaded - There is a single event thread that dispatches asynchronous work (result typically I/O but can be computation) and performs callback execution (i.e. handling of async work results).

The event thread runs in an infinite "event loop" doing the 2 jobs above; a) handling requests by dispatching async work, and b) noticing that previous async work results are ready and executing a callback to process the results.
The common analogy here is of the restaurant order taker: the event thread is a super-fast waiter that takes orders (services requests) from the dining room and delivers the orders to the kitchen to be prepared (dispatches async work), but also notices when food is ready (asynch results) and delivers it back to the table (callback execution).
The waiter doesn't cook any food; his job is to be going back and forth from dining room to kitchen as quickly as possible. If he gets bogged down taking an order in the dining room, or if he is forced to go back into the kitchen to prepare one of the meals, the system becomes inefficient and sytem throughput suffers.

Asynchronous The asynchronous workflow resulting from a request (e.g. a web request) is logically a chain: e.g.

   FIRST [ASYNC: read a file, figure out what to get from the database] THEN 
   [ASYNC: query the database] THEN 
   [format and return the result].

The work labeled "ASYNC" above is "kitchen work" and the "FIRST []" and "THEN []" represent the involvement of the waiter initiating a callback.

Chains like this are represented programmatically in 3 common ways:

nested functions/callbacks
promises chained with .then()
async methods that await() on async results.

All these coding approaches are pretty much equivalent, although asynch/await appears to be the cleanest and makes reasoning about asynchronous coding easier.

This is my mental picture of what's going on...is it correct? Comments very much appreciated!

Questions

My questions concern the use of OS-supported asynchronous operations, who actually does the asynchronous work, and the ways in which this architecture is more performant than the "spawn a thread per request" (i.e. multiple cooks) architecture:

Node libraries have been design to be asynchronous by making use of the cross-platform asynch library libuv, correct? Is the idea here that libuv presents node (on all platforms) with a consistent async I/O interface, but then uses platform-dependent async I/O operations under the hood? In the case where the I/O request goes "all the way down" to an OS-supported async operation, who is "doing the work" of waiting for the I/O to return and triggering node? Is it the kernel, using a kernel thread? If not, who? In any case, how many requests can this entity handle?
I've read that libuv also makes use of a thread pool (typically pthreads, one per core?) internally. Is this to 'wrap' operations that do not "go all the way down" as async, so that a thread can be used to sit and wait for a synchronous operation, so libuv can present an async API?
With regard to performance, the usual illustration that's given to explain the performance boost a node-like architecture can provide is: picture the (presumably slower and fatter) thread-per-request approach -- there's latency, CPU, and memory overhead to spawning a bunch of threads that are just sitting around waiting on I/O to complete (even if they're not busy-waiting) and then tearing them down, and node largely makes this go away because it uses a long-lived event thread to dispatch asynch I/O to the OS/kernel, right? But at the end of the day, SOMETHING is sleeping on a mutex and getting woken up when the I/O is ready...is the idea that if it's the kernel that's way more efficient than if it's a userland thread? And finally, what about the case where the request is handled by libuv's thread pool...this seems similar to the thread-per-request approach except for the efficiency of using the pool (avoiding the bring-up and tear-down), but in this case what happens when there's many requests and the pool has a backlog?...latency increases and now you're doing worse than the thread-per-request, right?

slebetman · Accepted Answer · 2019-11-08T12:29:54.373

There are good answers here on SO that can give you a clearer picture of the architecture. However, you have some specific questions that can be answered.

Who is "doing the work" of waiting for the I/O to return and triggering node? Is it the kernel, using a kernel thread? If not, who? In any case, how many requests can this entity handle?

Actually, both threads and async I/O are implemented on top of the same primitive: OS event queue.

Multitasking OSes were invented in order to allow users to execute multiple programs in parallel using a single CPU core. Yes, multi-core, multi-thread systems did exist back then but they were big (usually the size of two or three average bedrooms) and expensive (usually the cost of one or two average houses). These systems can execute multiple operations in parallel without the help of an OS. All you need is a simple loader program (called an executive, a primitive DOS-like OS) and you can create threads in assembly directly without the help of an OS.

Cheaper, more mass-produced computers can only run one thing at a time. For a long time this was acceptable to users. However, people who got used to time-sharing systems wanted more from their computers. Thus processes and threads were invented.

But at the OS level there are no threads. The OS itself provide the threading service (well.. technically you CAN implement threading as a library without needing OS support). So how does OS implement threads?

Interrupts. It's the core of all asynchronous processing.

A process or thread is simply an event waiting to be processed by the CPU and managed by the OS. This is possible because the CPU hardware supports interrupts. Any thread or process waiting for an I/O event (from mouse, disk, network etc.) is stopped, suspended and added to the event queue and other processes or threads are executed during the wait time. There is also a timer built into the CPU that can trigger an interrupt (amazingly enough the interrupt is called a timer interrupt). This timer interrupt triggers the OS's process/thread management system so that you can still run multiple processes in parallel even if none of them are waiting for I/O events.

This is the core of multitasking. This kind of programming (using timers and interrupts) are normally not taught except in operating system design, embedded programming (where you often need to do OS-like things without an OS) and real-time programming.

So, what's the difference between async I/O and processes?

They're exactly the same thing except for the API the OS exposes to the programmer:

Process / threads : Hey there programmer, pretend you are writing a simple program for a single CPU and pretend you have full control of the CPU. Go ahead, use my I/O. I'll maintain the illusion of you controlling the CPU while I handle the mess of running things in parallel.
Asynchronous I/O : You think you know better than me? OK, I let you add event listeners directly to my internal queue. But I'm not going to handle which function gets called when the event happens. I'm just going to rudely wake your process and you handle all that yourself.

In the modern world of multi-core CPUs the OS still does this kind of process management because a typical modern OS runs dozens of processes while the PC typically only have two or four cores. With multi-core machines there's another difference:

Process / threads : Since I'm handling the process queue for you I guess you won't mind if I spread the load of the threads you ask me to run across several CPUs will you? This way I'll let the hardware do the work in parallel.
Async I/O : Sorry, I can't spread all the different wait callbacks across different CPUs because I have no idea what the hell your code is doing. Single core for you!

I've read that libuv also makes use of a thread pool (typically pthreads, one per core?) internally. Is this to 'wrap' operations that do not "go all the way down" as async

Yes.

Actually to the best of my knowledge all OSes provide good enough async I/O interface that you don't need thread pools. The programming language Tcl have been handling async I/O like node without the help of thread pools since the 80s. But it's very messy and not so simple. Node developers decided they didn't want to handle this mess when it comes to disk I/O and simply use the more well-tested blocking file API with threads.

But at the end of the day, SOMETHING is sleeping on a mutex and getting woken up when the I/O is ready

I hope my answer to (1) also answers this question. But if you want to know what that something is I suggest you read about the select() function in C. If you know C programming I suggest you try write a TCP/IP program without threads using select(). Google "select c". I have a much more detailed explanation of how this all works at the C level in another answer: I know that callback function runs asynchronously, but why?

What happens when there's many requests and the pool has a backlog?...latency increases and now you're doing worse than the thread-per-request, right?

I hope once you understand my answer to (1) you will also realize that there's no escape from the backlog even if you use threads. The hardware does not really have support for OS-level threads. Hardware threads are limited to the number of cores so at the hardware level the CPU is a thread pool. The difference between single-threaded and multi-threaded is simply multi-threaded programs can really execute several threads in parallel in hardware while single-threaded programs can use only a single CPU.

The only REAL difference between async I/O and traditional multi-threaded programs is the thread creation latency. In this sense, there's no advantage programs like node.js has over programs that use thread pools like nginx and apache2.

However, due to the way CGI works programs like node.js will still have higher throughput because you don't have to re-initialize the interpreter and all the objects in your program for each request. This is why most languages have moved to web frameworks that runs as a HTTP service (like node's Express.js) or something like FastCGI.

Note: Do you really want to know what's the big deal about thread creation latency? In the late 90s/ early 2000s there was a web server benchmark. Tcl, a language notoriously 500% slower than C on average (because it's based on string processing like bash) managed to outperform apache (this was before apache2 and triggered the complete re-architecture that created apache2). The reason is simple: tcl had good async I/O api so programmers are more likely to use async I/O. This alone beat a program written in C (not that C does not have async I/O, tcl was written in C after all).

The core advantage of node.js over languages like Java is not that it has async I/O. It's that async I/O is pervasive and the API (callbacks, promises) is easy to use so that you can write an entire program using async I/O without needing to drop down to assembly or C.

If you think callbacks are hard to use I strongly suggest you try writing that select() based program in C.

@RandyCasburn You never need to know any of this unless you're writing an OS or writing an OS-less embedded program on a microcontroller or is the type of person who can't stand not knowing how the magic works — slebetman, Mar 05 '18 at 01:16
@slebman - yes of course - was an weak attempt at a show of appreciation for you and your ability to transfer the knowledge. thanks for taking the time. — Randy Casburn, Mar 05 '18 at 01:19

Node js architecture and performance

This is my mental picture of what's going on...is it correct? Comments very much appreciated!

1 Answers1

Linked