236

I've started tinkering with Node.js HTTP server and really like to write server side Javascript but something is keeping me from starting to use Node.js for my web application.

I understand the whole async I/O concept but I'm somewhat concerned about the edge cases where procedural code is very CPU intensive such as image manipulation or sorting large data sets.

As I understand it, the server will be very fast for simple web page requests such as viewing a listing of users or viewing a blog post. However, if I want to write very CPU intensive code (in the admin back end for example) that generates graphics or resizes thousands of images, the request will be very slow (a few seconds). Since this code is not async, every requests coming to the server during those few seconds will be blocked until my slow request is done.

One suggestion was to use Web Workers for CPU intensive tasks. However, I'm afraid web workers will make it hard to write clean code since it works by including a separate JS file. What if the CPU intensive code is located in an object's method? It kind of sucks to write a JS file for every method that is CPU intensive.

Another suggestion was to spawn a child process, but that makes the code even less maintainable.

Any suggestions to overcome this (perceived) obstacle? How do you write clean object oriented code with Node.js while making sure CPU heavy tasks are executed async?

Olivier Lalonde
  • 19,423
  • 28
  • 76
  • 91
  • 2
    Olivier, you asked the identical question I had in mind (new to node) and specifically with regard to processing images. In Java I can use an fixed-thread ExecutorService and pass it all the resize jobs and wait on it to finish from all the connection, in node, I haven't figured out how to shuffle off work to an external module that limits (let's say) the maximum number of simultaneous operations to 2 at a time. Did you find an elegant way of doing this? – Riyad Kalla Sep 02 '11 at 23:13

5 Answers5

306

This is misunderstanding of the definition of web server -- it should only be used to "talk" with clients. Heavy load tasks should be delegated to standalone programs (that of course can be also written in JS).
You'd probably say that it is dirty, but I assure you that a web server process stuck in resizing images is just worse (even for lets say Apache, when it does not block other queries). Still, you may use a common library to avoid code redundancy.

EDIT: I have come up with an analogy; web application should be as a restaurant. You have waiters (web server) and cooks (workers). Waiters are in contact with clients and do simple tasks like providing menu or explaining if some dish is vegetarian. On the other hand they delegate harder tasks to the kitchen. Because waiters are doing only simple things they respond quick, and cooks can concentrate on their job.

Node.js here would be a single but very talented waiter that can process many requests at a time, and Apache would be a gang of dumb waiters that just process one request each. If this one Node.js waiter would begin to cook, it would be an immediate catastrophe. Still, cooking could also exhaust even a large supply of Apache waiters, not mentioning the chaos in the kitchen and the progressive decrease of responsitivity.

mbq
  • 18,510
  • 6
  • 49
  • 72
  • 6
    Well, in an environment where web servers are multi-threaded or multi-process and can handle more than one concurrent request, it is very common to spend a couple of seconds on a single request. People have come to expect that. I'd say that the misunderstanding is that node.js is a "regular" web server. Using node.js you have to adjust your programming model a bit, and that includes pushing "long-running" work out to some asynchronous worker. – Thilo Aug 16 '10 at 09:39
  • 1
    @Thilio Right, but it is still a bad practice. While Node is just making bad practices work bad, I think it is an advantage of its approach. – mbq Aug 16 '10 at 09:43
  • That's what I was my original plan when discovering Node.js but it seems the general consensus is to keep everything in one process while trying to make everything non blocking. Perhaps I should not blindly follow the general consensus and spawn a child process for every request? – Olivier Lalonde Aug 16 '10 at 09:46
  • 13
    Don't spawn a child process for every request (that defeats the purpose of node.js). Spawn workers from inside your heavy requests only. Or route your heavy background work to something other than node.js. – Thilo Aug 16 '10 at 09:49
  • 4
    Great analogy. I have kids, and it brought to mind the culminating scene in the movie Ratatouille where the guy was the lone waiter racing around on roller skates and the kitchen was staffed by an army of small worker processes (rats). :D – Paul Sep 14 '11 at 19:10
  • 6
    Ha, I really like that. "Node.js: making bad practices work badly" – ethan Sep 27 '11 at 00:07
  • 7
    @mbq I like the analogy but it could use some work. The traditional multithreaded model would be a person who is both waiter and cook. Once the order is taken, that person has to go back and cook the meal before being able to handle another order. The node.js model has the nodes as waiters, and webworkers as cooks. The waiters handle fetching/resolving the requests while the workers manage the more time-intensive tasks. If you need to scale larger you just make the main server a node cluster and reverse proxy the CPU intensive tasks to other servers built for milti-threaded processing. – Evan Plaice Jan 24 '12 at 21:53
  • 1
    @EvanPlaice Well, my main point is that http server shouldn't do any "cooking" at all (even if the scalability is not an issue, it is a less secure and overcomplicated solution, not to mention that it is in fact much harder to maintain), that's why it was equated with waiter(s) only. – mbq Jan 25 '12 at 00:10
  • @mbq I think we're saying the same thing. Waiters (IO, dispatching) == Node.js and cooks (CPU bound, time intensive) == something else like a webworker, nginx server, etc... I.e. if scalability is an issue, use the right tool for the job. – Evan Plaice Jan 25 '12 at 00:13
  • 1
    +! this is one of the best analogies for evented I/O that I have ever seen. – Brandon Mar 11 '13 at 17:07
  • 1
    _Heavy load tasks should be delegated to standalone programs_ so you're saying that I should send those commands to another program which resides on another computer ? but this takes time also. can you please explain ? if I have a very intensive work. what should I do now? – Royi Namir Apr 14 '13 at 08:45
  • 1
    @RoyiNamir I haven't said anything about another computer; both may reside on one. The point is that a web server is the last application that should care about lengthy jobs being done -- it should only schedule jobs and deliver their results. You need some other tool to queue and possibly distribute jobs, execute them, handle their results and failures and watch for resources. – mbq Apr 14 '13 at 10:53
  • 1
    It seems to me if you have to jump through so many hoops to get around the design of node.js, maybe node.js isn't the best solution to the cpu intensive code problem. – stu Feb 08 '15 at 13:17
  • 1
    @stu yes, NodeJS should not be used for CPU intensive jobs. Its built for scalability and scalability can be achieved only if you dont block the single thread of Node. Yes, node is single threaded, so blocking it with CPU intensive task will kill the benefits of it. As mentioned by many others on this thread, NodeJS is good for web-server, which just take request, delegate it and finally give back the response. Here is good write up on this topic - CPU intensive tasks in Node - http://neilk.net/blog/2013/04/30/why-you-should-use-nodejs-for-CPU-bound-tasks/ – Gaurav Dhiman Mar 27 '15 at 10:19
  • 1
    @mbq If it doesn't create new threads or processes for each new task that arrives, what exactly does it do with them? Even if they don't block other tasks from getting started, these tasks still run somehow using the same fixed CPU and memory of the machine. This machine has to do all the work somehow regardless of the software used. If I understand it right, the difference is that with events the entire CPU time for a task is chopped into smaller pieces which can be interleaved with other task time so that easy tasks finish sooner on average and harder ones later on average. Is that right? – Dan Cancro Sep 13 '15 at 00:34
  • Yup; the idea is that a network application mostly waits for OS and hardware to perform communication, and this chunks of waiting are points where the single node execution process jumps between tasks (from the code side of view those are times between callback installation and execution). – mbq Sep 14 '15 at 12:02
  • 1
    This answer is BS. Who are you to determine what a web server is? Web servers can certainly do heavy-weight processing in response to requests. It's just that in most web servers, you don't have to worry about it, because each request is coming in on a separate thread. – Jez Sep 08 '16 at 10:59
  • loved the analogy !! – Rachit Kyte.One Jan 04 '18 at 12:59
  • Can we use Node.js Cluster to implement the cooks in Node.js – Indika K Dec 09 '19 at 04:20
  • I will probably catch some flack for this but I think you probably shouldn't be using node. There are many languages that allow you to more gracefully handle concurrent operations in one request. Obviously you don't want to spawn off too much load in one request but it's common to want to make operations happen concurrently. Go, java, rust are some examples of languages that would handle this type of problem better. – pwaterz Nov 04 '21 at 14:54
63

What you need is a task queue! Moving your long running tasks out of the web-server is a GOOD thing. Keeping each task in "separate" js file promotes modularity and code reuse. It forces you to think about how to structure your program in a way that will make it easier to debug and maintain in the long run. Another benefit of a task queue is the workers can be written in a different language. Just pop a task, do the work, and write the response back.

something like this https://github.com/resque/resque

Here is an article from github about why they built it http://github.com/blog/542-introducing-resque

oleksii
  • 35,458
  • 16
  • 93
  • 163
Tim
  • 2,359
  • 2
  • 23
  • 21
  • 42
    Why are you linking to Ruby libraries in a question specifically grounded in the node world? – Jonathan Dumaine Dec 03 '11 at 22:21
  • 1
    @JonathanDumaine It's a good implementation of a task queue. Rad the ruby code and rewrite it in javascript. PROFIT! – Simon Stender Boisen Jan 12 '13 at 22:12
  • 2
    I'm a big fan of gearman for this, gearman workers don't poll a gearman server for new jobs - new jobs are instantly pushed to the workers. Very responsive – Casey Flynn Mar 14 '13 at 15:43
  • 1
    In fact, someone has ported it to the node world: https://github.com/technoweenie/coffee-resque – FrontierPsycho Mar 11 '15 at 10:20
  • @pacerier, why you say that? What do you propose? – luis.espinal Jun 21 '17 at 09:59
  • If you were looking for an approach that stays in Node, then I recommend reading http://neilk.net/blog/2013/04/30/why-you-should-use-nodejs-for-CPU-bound-tasks/ And if you want to skip the read, and get straight to using threads as workers, check out https://github.com/xk/node-threads-a-gogo – Brendan Weinstein Oct 17 '17 at 02:11
  • Also see https://github.com/audreyt/node-webworker-threads and https://github.com/avoidwork/tiny-worker which appear to be more actively maintained – Brendan Weinstein Oct 17 '17 at 06:20
  • [bottleneck](https://github.com/SGrondin/bottleneck) has been a good module for easily creating queue rules in javascript/node. Though I have also been exploring rxjs as well. – omencat Dec 06 '17 at 19:00
  • For anyone wondering, it now has native node support https://nodejs.org/api/worker_threads.html – BonisTech Dec 21 '22 at 21:44
27

You don't want your CPU intensive code to execute async, you want it to execute in parallel. You need to get the processing work out of the thread that's serving HTTP requests. It's the only way to solve this problem. With NodeJS the answer is the cluster module, for spawning child processes to do the heavy lifting. (AFAIK Node doesn't have any concept of threads/shared memory; it's processes or nothing). You have two options for how you structure your application. You can get the 80/20 solution by spawning 8 HTTP servers and handling compute-intensive tasks synchronously on the child processes. Doing that is fairly simple. You could take an hour to read about it at that link. In fact, if you just rip off the example code at the top of that link you will get yourself 95% of the way there.

The other way to structure this is to set up a job queue and send big compute tasks over the queue. Note that there is a lot of overhead associated with the IPC for a job queue, so this is only useful when the tasks are appreciably larger than the overhead.

I'm surprised that none of these other answers even mention cluster.

Background: Asynchronous code is code that suspends until something happens somewhere else, at which point the code wakes up and continues execution. One very common case where something slow must happen somewhere else is I/O.

Asynchronous code isn't useful if it's your processor that is responsible for doing the work. That is precisely the case with "compute intensive" tasks.

Now, it might seem that asynchronous code is niche, but in fact it's very common. It just happens not to be useful for compute intensive tasks.

Waiting on I/O is a pattern that always happens in web servers, for example. Every client who connects to your sever gets a socket. Most of the time the sockets are empty. You don't want to do anything until a socket receives some data, at which point you want to handle the request. Under the hood an HTTP server like Node is using an eventing library (libev) to keep track of the thousands of open sockets. The OS notifies libev, and then libev notifies NodeJS when one of the sockets gets data, and then NodeJS puts an event on the event queue, and your http code kicks in at this point and handles the events one after the other. Events don't get put on the queue until the socket has some data, so events are never waiting on data - it's already there for them.

Single threaded event-based web servers makes sense as a paradigm when the bottleneck is waiting on a bunch of mostly empty socket connections and you don't want a whole thread or process for every idle connection and you don't want to poll your 250k sockets to find the next one that has data on it.

masonk
  • 9,176
  • 2
  • 47
  • 58
  • should be correct answer.... as for solution where you spawn 8 clusters, you'd need 8 cores right? Or load balancer with multiple servers. – Muhammad Umer Dec 30 '17 at 05:17
  • also what is good way to learn about 2nd solution, setting up a queue. Concept of queue is pretty simply, but it's messaging part between processes and queue that's foreign. – Muhammad Umer Dec 30 '17 at 05:18
  • That's right. You need to get the work onto another core, somehow. For that, you need another core. – masonk Jan 04 '18 at 18:43
  • Re: queues. The practical answer is to use a job queue. There are some available for node. I've never used any of them so I can't make a recommendation. The curiosity answer is that worker processes and queue processes are ultimately going to communicate over sockets. – masonk Jan 04 '18 at 18:45
  • 'Asynchronous code isn't useful if it's your processor that is responsible for doing the work. That is precisely the case with "compute intensive" tasks.' This summed it all! – KJ Sudarshan Jul 04 '21 at 14:54
7

Couple of approaches you can use.

As @Tim notes, you can create an asynchronous task that sits outside or parallel to your main serving logic. Depends on your exact requirements, but even cron can act as a queueing mechanism.

WebWorkers can work for your async processes but they are currently not supported by node.js. There are a couple of extensions that provide support, for example: http://github.com/cramforce/node-worker

You still get you can still reuse modules and code through the standard "requires" mechanism. You just need to ensure that the initial dispatch to the worker passes all the information needed to process the results.

Toby Hede
  • 36,755
  • 28
  • 133
  • 162
0

Use child_process is one solution. But each child process spawned may consume a lot of memory compared to Go goroutines

You can also use queue based solution such as kue

neo
  • 360
  • 3
  • 9