I have a question about the architecture and performance of Node js.
I've done a bunch of reading on the topic (including on Stack Overflow), and I still have a couple of questions. I'd like to do 2 things:
- Summarize what I've learned from crawling many different sources semi-concisely to see if my conclusions are correct.
- Ask a couple questions about the threading and performance of Node that I haven't been able to pin down exact answers on from my research.
Node has a Single-Threaded, Asynchronous Event-Handling Architecture
Single-Threaded - There is a single event thread that dispatches asynchronous work (result typically I/O but can be computation) and performs callback execution (i.e. handling of async work results).
The event thread runs in an infinite "event loop" doing the 2 jobs above; a) handling requests by dispatching async work, and b) noticing that previous async work results are ready and executing a callback to process the results.
The common analogy here is of the restaurant order taker: the event thread is a super-fast waiter that takes orders (services requests) from the dining room and delivers the orders to the kitchen to be prepared (dispatches async work), but also notices when food is ready (asynch results) and delivers it back to the table (callback execution).
The waiter doesn't cook any food; his job is to be going back and forth from dining room to kitchen as quickly as possible. If he gets bogged down taking an order in the dining room, or if he is forced to go back into the kitchen to prepare one of the meals, the system becomes inefficient and sytem throughput suffers.
Asynchronous The asynchronous workflow resulting from a request (e.g. a web request) is logically a chain: e.g.
FIRST [ASYNC: read a file, figure out what to get from the database] THEN
[ASYNC: query the database] THEN
[format and return the result].
The work labeled "ASYNC" above is "kitchen work" and the "FIRST []" and "THEN []" represent the involvement of the waiter initiating a callback.
Chains like this are represented programmatically in 3 common ways:
nested functions/callbacks
promises chained with .then()
async methods that await() on async results.
All these coding approaches are pretty much equivalent, although asynch/await appears to be the cleanest and makes reasoning about asynchronous coding easier.
This is my mental picture of what's going on...is it correct? Comments very much appreciated!
Questions
My questions concern the use of OS-supported asynchronous operations, who actually does the asynchronous work, and the ways in which this architecture is more performant than the "spawn a thread per request" (i.e. multiple cooks) architecture:
Node libraries have been design to be asynchronous by making use of the cross-platform asynch library libuv, correct? Is the idea here that libuv presents node (on all platforms) with a consistent async I/O interface, but then uses platform-dependent async I/O operations under the hood? In the case where the I/O request goes "all the way down" to an OS-supported async operation, who is "doing the work" of waiting for the I/O to return and triggering node? Is it the kernel, using a kernel thread? If not, who? In any case, how many requests can this entity handle?
I've read that libuv also makes use of a thread pool (typically pthreads, one per core?) internally. Is this to 'wrap' operations that do not "go all the way down" as async, so that a thread can be used to sit and wait for a synchronous operation, so libuv can present an async API?
With regard to performance, the usual illustration that's given to explain the performance boost a node-like architecture can provide is: picture the (presumably slower and fatter) thread-per-request approach -- there's latency, CPU, and memory overhead to spawning a bunch of threads that are just sitting around waiting on I/O to complete (even if they're not busy-waiting) and then tearing them down, and node largely makes this go away because it uses a long-lived event thread to dispatch asynch I/O to the OS/kernel, right? But at the end of the day, SOMETHING is sleeping on a mutex and getting woken up when the I/O is ready...is the idea that if it's the kernel that's way more efficient than if it's a userland thread? And finally, what about the case where the request is handled by libuv's thread pool...this seems similar to the thread-per-request approach except for the efficiency of using the pool (avoiding the bring-up and tear-down), but in this case what happens when there's many requests and the pool has a backlog?...latency increases and now you're doing worse than the thread-per-request, right?