4

So i'm have an app with Socket.IO which purpose is to search some data on different sites. Something like crawler... The main problem is that the search process is too long and while it happens my app stucks... For example if one user starts to search second need to wait until first completed...

Each site which need to be searched is represented as a separate class so i do something like:

selected_sites.forEach(function(site_name) {
    var site = new sites[site_name];

    site.on('found', function(data) {
        socket.emit('found', data);
    });

    site.on('not_found', function() {
        socket.emit('not_found', 'Nothing found at ' + site.getSiteName());
    });

    site.search(socket_data.params);
});

Is it possible somehow to move the "class body | search progress" "somewhere else | in a new thread" so that event loop not be blocked while search in progress?

Kin
  • 4,466
  • 13
  • 54
  • 106
  • 1
    How is `site.search()` implemented? I find it hard to believe that it would be scraping a site *synchronously*. – mscdex Apr 08 '16 at 17:26
  • 1
    No, you can't run a new thread in Node.js, it's single threaded. You can use `cluster` package to run node process on each processor Core, or move complex functions to external node app and do async calls to it from the main app. – alexmac Apr 08 '16 at 17:27
  • @mscdex there is 11 sites which provides from 300 to 500 results each not including the pagination... In `search` method are few requests in which callback are other requests and so on until it reaches the item... – Kin Apr 08 '16 at 17:28
  • I agree with @mscdex, `site.search` is probably asynchronous. What happens if you put `console.log(site_name)` before calling `site.search`?It'll probably output all 11 site names before even start crawling the first item. – goenning Apr 08 '16 at 17:35
  • @goenning, but as I said before in the search method there a lot of `request`, like 500-600 for each site. In general here is this situation http://zef.me/blog/4561/node-js-and-the-case-of-the-blocked-event-loop – Kin Apr 08 '16 at 18:37
  • `request` is an I/O operation (asyncronous for node), unlike `JSON.parse()` from that blog example, which depends on CPU time (syncronous). Every time you fire a new `request`, the event loop is free again. The problem I see here is what you're doing with the response from that request. If it's CPU intensive, then you may need to follow suggestions from others answers (clustering). – goenning Apr 08 '16 at 19:06
  • @goenning in the callback there is parsing with `cheerio`. What would be your suggestions to put in cluster? Full site or only parsing? – Kin Apr 08 '16 at 19:09
  • I have a similar project that's crawling 20+ sites, but it all happens in background, there is no user interaction like yours. Searching 11 sites * 400 requests each site will take a long, long time, be it threaded or not. In my project, every minute a cron task is spawn and my node script is executed. It takes a single website and start crawling it. Next minute another node process is started and another website is taken for crawling. – goenning Apr 08 '16 at 19:29

4 Answers4

6

node.js does not allow you to run more threads of Javascript execution at the same time. A single node.js process only runs one Javascript thread of execution at a time. Because of asynchronous I/O, multiple Javascript operations may be "in flight" at any given time, but only one is actually running at any given time (while the others may be waiting for I/O operations to complete).

The usual way to solve a problem where you want some longer running and/or CPU intensive application to be run in the background while your server is free to handle incoming requests is to move the time consuming operation into it's own node.js process (often using the child process module) and then allow those two processes to share information as required, either via a database or via some interprocess communication like sockets.

If you have multiple CPU intensive operations, you can fire up multiple secondary processes or you can use the node.js clustering module in order to take maximum advantage of all CPUs in the host computer.

You should know that if most of your code is just networking or file I/O, then that can all be done with asynchronous operations and your node.js server will scale quite well to doing many different things in parallel. If you have CPU intensive operations (lots of parsing or calculations), then you will want to start up multiple processes in order to more effectively utilize multiple CPUs and let the system time slice the work for you.

Update in 2020: Nodejs now has threading. You can use Worker Threads. This would not be needed to parallelize I/O operations, but could be useful for paralellizing CPU-heavy operations and taking advantage of multiple CPU cores.

jfriend00
  • 683,504
  • 96
  • 985
  • 979
  • Is there any tutorials on how to `start up multiple processes` cause this is the thing i need to... – Kin Apr 08 '16 at 18:43
  • @Kin - Both the links in my answer (the clustering link and the child process link) show code examples. And, if you search for nodejs clustering or nodejs child process, there are thousands of articles. Probably you want to make one master nodejs process for your crawler and then start up a pool of child processes that each crawl a site and report back results. – jfriend00 Apr 08 '16 at 19:53
1

NodeJS is single threaded, but you are able to create clusters. I recommend reading: http://www.sitepoint.com/how-to-create-a-node-js-cluster-for-speeding-up-your-apps/

With this, you are able to share server handles and use Inter-process communication to communicate with the parent Node process.

Hard Tacos
  • 370
  • 1
  • 5
  • 19
1

So you have a few options here. Depending on what exactly the search function does, one of these options would work the best:

  1. Node.js child processes

  2. Writing the search method asynchronously. If it is implemented in javascript, than this should be possible using process.nextTick(See this question); if it is a C/C++ implementation, it's more complicated, and child processes would probably be the way to go.

Community
  • 1
  • 1
Chandler Freeman
  • 899
  • 1
  • 10
  • 25
1

Since this Question is 2 years old now I though Ill give an update on that.

Most answers here are based on the claim, that NodeJS is single threaded, which is only partly true.
NodeJS is Event driven with a single threaded event loop. While this is still the case, NodeJS was recently extended with Multi-threading support (since NodeJS v10.5.0) in form of so called Worker-Threads.

Those features are still experimental, so it is probably better to stick to Child Processes for now.
I just wanted to give an update on that, since NodeJS is now considered multithreaded.

NullDev
  • 6,739
  • 4
  • 30
  • 54