0

I have a Node app, built on Express, that uses a web scraper to load and parse data.

I've read a lot about NodeJS's scaleability and being able to handle a heap of concurrent connections, however when you're running a web scraper (sending off 1000+ concurrent requests) I feel like things start to crumble a bit.

When running, my server is unresponsive to other API requests, and running several instances at once causes things to slow down to a snail's pace.

I can't find any documentation on what the limits are, what they should be, how many requests I should be pooling together and so on.

Should I be limiting my scraper's requests to 10 per second? 100 per second? 1000 per second? Or could I perhaps be increasing the amount of CPU/memory allocated to my node process on my VPS?

EDIT: For those voting to close because this question is too opinion based, here's concretely what I'm asking:

  1. How many HTTP requests can an Express app simultaneously perform before it starts to hit performance
  2. Does increasing memory / cpu available to the app help in any way?
JVG
  • 20,198
  • 47
  • 132
  • 210
  • 1
    When we say NodeJS can handle 1000+ concurrent requests, they are essentially non-blocking requests i.e. not very CPU intensive tasks. If the web scraper is very CPU intensive task, better use cluster of node servers with load balancer on top of these servers. – Aman Gupta Jan 20 '16 at 20:06
  • @AmanGupta Awesome, these are terms I've not heard before. Can you suggest any resources for learning more about load balancing and working in clusters? – JVG Jan 20 '16 at 20:11
  • 1
    You can start with this : http://www.sitepoint.com/how-to-create-a-node-js-cluster-for-speeding-up-your-apps/ – Aman Gupta Jan 20 '16 at 20:13

1 Answers1

9

There are a lot of different ways to assess Node's performance. Node is usually recommended for I/O bound workloads as opposed to CPU bound workloads, although the V8 engine it runs on continues to improve.

An important aspect of getting Node to perform is coding in a way that enables its "non-blocking" execution model. This means using callback functions and/or promises for control flow, instead of traditional synchronous methods. Node will block if you do not write asynchronous code because the event loop will hang up on code that needs any non-trivial amount of time to complete.

I/O can (and should) be made asynchronous with Node, but CPU-heavy activities (like parsing .xml after you scrape it) cannot (or not to the same degree), so the event loop will end up hanging up on each long CPU task.

To apply this to your specific use case and address performance issues, it may be helpful if you posted some of your scraper's request code.

Note: I apologize in advance if you already understand these concepts and this is below your skill level.

I've included a snippet of code that starts a series of requests for a range of .xml resources and prints the responses to the console. If you run this code, you will notice that often times the printing will occur "out of order", since each request can take a different amount of time. The advantage of giving the http.request() method a callback instead of using the synchronous version is that once the request starts, your application can continue to run and accept new requests. The work can be completed incrementally with each completion of the Node event loop.

This code snippet can be greatly simplified by using a library that specializes in requests. A well known one is called request (aptly named) and it can help make your code more succinct.

As a side note, using console.log() in your project a lot can cause performance issues.

var http = require('http');

function getData(index) {
  var options = {
    'hostname' : 'example.com',
    'path' : '/data' + index + '.xml',
    'method' : 'GET'
  };    
  var req = http.request(options, function(response) {
     var fullText = "";
     // listen for incoming data and add it to existing data
     response.on('data', function(more) {
         fullText += more;
     });
     // when request is complete, print it
     response.on('end', function(done) {
         console.log(fullText);
     });
  });
  req.end();
  // Do not fail silently, show error details
  req.on('error', function(e) {
     console.error(e);
  });
}

for(var i = 0; i < 1000; ++i) {
    getData(i);
}
Community
  • 1
  • 1
Will R.
  • 453
  • 3
  • 9
  • This is awesome mate, and no need to apologise as this is exactly the sort of thing I was hoping for. While I'm getting a lot better and understanding blocking/non-blocking code in Node it's difficult to find simple explanations of these concepts. Definitely didn't know that about `console.log()` either! – JVG Jan 20 '16 at 20:40
  • One final question, if I implement a queuing system for `request`, any suggestions on how many requests I should batch together and how often to send them? – JVG Jan 20 '16 at 21:47
  • http://stackoverflow.com/a/19101225/3602796 has a good explanation of how the requests module actually does queuing for you when using a pooling mode. – Will R. Jan 20 '16 at 22:07