Scraping fully rendered webpage with nodejs

Question

I am trying to get amazon pricing information with nodejs.

Here's the target url: http://aws.amazon.com/ec2/pricing/

But the content of the pricing tables which I am reading in nodejs is not fully rendered and there are only javascripts.

So far I have used jsdom, jquerygo and phantom but I was not successful. Even setting timeouts does not help. Can anyone please provide me with a working solution for this specific case?

Thanks and best regards.

have you looked at http://stackoverflow.com/questions/9966826/save-and-render-a-webpage-with-phantomjs-and-node-js?rq=1? — bbill, Jun 23 '15 at 13:44
I recently started using [node-horseman](https://github.com/johntitus/node-horseman), and its abstraction is pretty great. — danemacmillan, Jun 23 '15 at 14:04

score 0 · Answer 1 · answered Jun 23 '15 at 13:54

There are different ways to scrape a web page using node.js

I was inspired by spookjs

 var Spooky = require('spooky');

 var spooky = new Spooky({
    child: {
        transport: 'http'
    },
    casper: {
        logLevel: 'debug',
        verbose: true
    }
  }, function (err) {
    if (err) {
        e = new Error('Failed to initialize SpookyJS');
        e.details = err;
        throw e;
    }

    spooky.start(
        'http://en.wikipedia.org/wiki/Spooky_the_Tuff_Little_Ghost');
    spooky.then(function () {
        this.emit('hello', 'Hello, from ' + this.evaluate(function ()     {
            return document.title;
        }));
    });
    spooky.run();
});

spooky.on('error', function (e, stack) {
console.error(e);

if (stack) {
    console.log(stack);
}
});


spooky.on('console', function (line) {
   console.log(line);
});

spooky.on('hello', function (greeting) {
   console.log(greeting);
});

spooky.on('log', function (log) {
   if (log.space === 'remote') {
     console.log(log.message.replace(/ \- .*/, ''));
   }
});

Note: Gives flexibility to run casperjs and phantom js using node.js

Thank you for the answer. I didn't try this solution because I got successful to read the content using phantom module and phantomjs. I also posted the answer. But anyway thank you again. — Saro, Jun 27 '15 at 12:46

score 0 · Accepted Answer · answered Jun 27 '15 at 12:43

This solved my issue:

I noticed that when installing phantom module in node, it was complaining about version of phantomjs (version 2) and was downloading version (1.9.8) in some temporary location.

Thus I installed version 1.9.8 instead and set the PATH variable to that. And it worked! Also must note that inside page.open(...) function you must setTimeout for quite a long time (in my case about 35 seconds) so that the whole page is fully loaded and rendered.

Scraping fully rendered webpage with nodejs

2 Answers2

Note: Gives flexibility to run casperjs and phantom js using node.js