There are a lot of different ways to assess Node's performance. Node is usually recommended for I/O bound workloads as opposed to CPU bound workloads, although the V8 engine it runs on continues to improve.
An important aspect of getting Node to perform is coding in a way that enables its "non-blocking" execution model. This means using callback functions and/or promises for control flow, instead of traditional synchronous methods. Node will block if you do not write asynchronous code because the event loop will hang up on code that needs any non-trivial amount of time to complete.
I/O can (and should) be made asynchronous with Node, but CPU-heavy activities (like parsing .xml after you scrape it) cannot (or not to the same degree), so the event loop will end up hanging up on each long CPU task.
To apply this to your specific use case and address performance issues, it may be helpful if you posted some of your scraper's request code.
Note: I apologize in advance if you already understand these concepts and this is below your skill level.
I've included a snippet of code that starts a series of requests for a range of .xml resources and prints the responses to the console. If you run this code, you will notice that often times the printing will occur "out of order", since each request can take a different amount of time. The advantage of giving the http.request()
method a callback instead of using the synchronous version is that once the request starts, your application can continue to run and accept new requests. The work can be completed incrementally with each completion of the Node event loop.
This code snippet can be greatly simplified by using a library that specializes in requests. A well known one is called request (aptly named) and it can help make your code more succinct.
As a side note, using console.log()
in your project a lot can cause performance issues.
var http = require('http');
function getData(index) {
var options = {
'hostname' : 'example.com',
'path' : '/data' + index + '.xml',
'method' : 'GET'
};
var req = http.request(options, function(response) {
var fullText = "";
// listen for incoming data and add it to existing data
response.on('data', function(more) {
fullText += more;
});
// when request is complete, print it
response.on('end', function(done) {
console.log(fullText);
});
});
req.end();
// Do not fail silently, show error details
req.on('error', function(e) {
console.error(e);
});
}
for(var i = 0; i < 1000; ++i) {
getData(i);
}