2

I'm pretty new with NodeJs.
I'm trying to download some html from a website in order to parse it and present some information for debug.
I try with success with http module (see this post), but in this way when I print chunk:

var req = http.request(options, function(res) {
    res.setEncoding("utf8");
    res.on("data", function (chunk) {
       console.log(chunk);
    });
});

I don't get all html that is loaded dynamically with ajax for instance:

<div class="container">
  ::before
      <div class="row">
        ::before
....
</div>

Are there any other module that can help me on this goal?

Thanks!

update

I would like to share with you my success (thanks to @oKonyk).

  • npm install phantomjs
  • create your script
  • use the same code suggested by @oKonyk

note that if you're running your script locally, you need to set this options:

options = { 'web-security': 'no' };
phantom.create({parameters: options}, function() {});
Community
  • 1
  • 1
Luca Davanzo
  • 21,000
  • 15
  • 120
  • 146

1 Answers1

4

In order to capture dynamically built pages you have to render them in browser. There are several options to do that with node.js.

I would suggest using phantomjs, which is a so called headless browser.

In order to proof the concept you can install npm install phantomjs -g globally. Create test script 'google.js' with following content:

var page = require('webpage').create();
console.log('The default user agent is ' + page.settings.userAgent);
page.settings.userAgent = 'SpecialAgent';
page.open('http://www.google.org', function(status) {
  if (status !== 'success') {
    console.log('Unable to access network');
  } else {
    var html = page.evaluate(function() {
      return document.getElementsByTagName('html')[0].innerHTML;
    });
    console.log(html);
  }
  phantom.exit();
});

Then run it as phantomjs google.js

You will get printed whole DOM of the page (at lest everything within <html> tags), which different from raw response that you are getting with http module.

Later you can use phantom within your node project (more info here).

Community
  • 1
  • 1
oKonyk
  • 1,468
  • 11
  • 16
  • Thanks for answer @oKonyk. With this solution I get "phantom stdout: NETWORK_ERR: XMLHttpRequest Exception 101: A network error occurred in synchronous requests." I'm running script from localhost, so I suppose that I have to setup "--web-security=false", isn't it? Where can I set this option? I'm running my script with node. – Luca Davanzo Feb 02 '16 at 12:01
  • I've created phantom.create({'web-security':'no'}, function (ph) {}); but errors still appear! – Luca Davanzo Feb 02 '16 at 12:09
  • So the URL that you are trying to access is `localhost`? If it's just any public URL, I could try to access it from my system... When you are saying that you run your script with node, do you mean you have your `phantomjs` process as child process on your main node script? – oKonyk Feb 02 '16 at 17:46